The best alarm-propagation practices you could adopt for your EMS

The best alarm-propagation practices you could adopt for your EMS

The best alarm-propagation practices you could adopt for your EMS

  • Posted by Ramanathan
  • On April 22, 2021
  • 0 Comments

Identification of faults and raising relevant alarms are important aspects of an efficient fault management module in an Element Management System (EMS). The EMS needs to be intelligent enough to know how to move forward and notify the users, based on the alarms collected. Read more to find out the best alarm propagation methods for an EMS.

Alarm representation for a group of devices

It is practically impossible for a service provider to monitor the status of each device (or its components) in the entire network from the network topology view (such as a map view). The problem is only compounded as the size of the network grows. In order to address this, the service provider typically groups devices in the following ways, which vastly simplifies alarm monitoring and management:

  1. Location-based grouping
  2. Hierarchy-based grouping

By the same token, the grouped management model does not provide the fault conditions that might be present in network entities (at the global network-view level). The administrator may not have complete visibility of the health of the network devices. This is because lower-level alarms (such as alarms from a port) tend to get lost in the grouping or aggregation process. There needs to be a methodology or process that would notify administrators of network faults at the global network level. This needs to be done without overwhelming the administrators nor making it difficult to discover the fault.

Alarm propagation

The solution is an EMS with a built-in alarm propagation feature that would automatically bump any alarm raised at the lowest level up to the next level. This “bumping” would happen until the alarm is displayed at the global network level. Propagating alarms from the port or card level to the device level makes it easier for the administrator to determine the health of the network. It is enough to monitor the network’s devices instead of its hundreds of ports.

Alarm propagation is a process through which alarms at lower levels (like the port or card level) are aggregated and represented to the administrator in an intelligent manner. The goal is to quickly notify the administrators about the network fault conditions. Alarm propagation can be implemented in different ways: based on the deployment topology or the service offering.

Containment hierarchy

A common alarm-propagation approach involves the containment hierarchy. For devices with a containment hierarchy (such as shelf, card, and port), the card status will be represented based on the aggregated status of all the ports in that card. So, if there is a major alarm in one of the ports and a critical alarm in another port, the card status will be represented as critical (the maximum alarm severity of all ports). Also, the alarm raised at the port level is propagated to the card, the shelf, and finally to the device.

Location-based aggregation

Another common propagation model used by service providers is based on the location of deployment (normally used in multi-site deployments). In this model, alarms from the devices are propagated to the site of deployment. These alarms are then propagated to the county, city, state, and country levels. This model provides network administrators with a clear view of problematic areas on a global map. It is not uncommon to find implementations that use a combination of containment hierarchy and location-deployment approaches.

What are the issues with the alarm-propagation model?

Consider an EMS that was implemented using a traditional alarm-propagation model: an alarm at the port level would be propagated to the card, shelf, device, and finally to the network. Since this process involves recursive escalations, the EMS might experience the following shortcomings:

  • For each low-level alarm, the propagation mechanism generates several alarms – which in turn increases the total number of events that the EMS needs to handle. This could potentially impact the overall performance.
  • The propagated alarm may not get immediately reflected when there are quite a few alarms in the event queue of the parent.
  • Since the propagation is recursive with an event queue, when the lower-level alarm gets propagated to the highest level, it is possible for the lower-level alarm to get cleared. The alarm’s clearance tends to follow the same recursive loop again, thus causing a delay to accurately reflect the status of the device.
  • When too many events get raised for escalation, it might also affect the time to process new alarms.
  • To drill down to the actual alarm, the administrator had to go through several layers, which is both annoying and time-consuming.

The Fix

The key requirement for the propagation scheme is to make a low-level alarm visible to the administrator (at the global network level) without overloading the administrator nor making it difficult to find the alarm. This can be achieved by propagating the alarm graphically from the lower level to the highest level without going through successive levels of recursion – which can significantly lower the overhead of an EMS. For instance, if the administrator is viewing the entire hierarchy (port, card, shelf, device, and network), then it doesn’t make sense to propagate a port level alarm to the shelf level. But if the administrator is concerned only at the device level view (of the network) and not interested in drilling down any further, then they should be able to see a lower-level alarm (like a port level alarm) at the device level. In this case, it is reasonable to propagate the port level alarm to the shelf and finally to the device. So instead of raising repeated alarms, the device status is defined as an aggregation of alarm conditions of all its components, and also considers the view chosen by the administrator.

Compared to the old approach, this new one offers much better performance. Since the new approach eliminates the considerable delay in representing a fault condition, it is easier for the administrator to diagnose a problem.

Does your EMS come equipped with such fixes? It is always better to get the best-in-class EMS for your devices to have a hassle-free experience. Make sure to check out NetMan, an EMS powered by DMS Technology, which comes packed with all the critical features. With more than 16 years of experience, we assure you that you will be in good hands.

0 Comments

Leave Reply

Your email address will not be published. Required fields are marked *