In summary, the Service Level Agreement is customer-facing and supports the services offered by the IT department. The Operating Level Agreement permits the various teams, groups, and suppliers to work cohesively together to deliver the IT services in support of the SLA.
Service Level Agreements (SLAs) defining the quality attributes (QoS - Quality of Service) and guarantees a service is required to process, are of growing commercial interest with a deep impact on the strategic and organizational processes, as many research studies and intensified interest in accepted management standards like ITIL v4 show. They are used in all areas of IT reaching from hosting or communication services to help desk or problem resolution. A well-defined and effective SLA correctly fulfills the expectations of all participants and provides metrics for accurately measuring performance to the guaranteed Service Level (SL) objectives. During the monitoring and enforcement phase, the defined metrics will be used to detect violations to the promised SLs and to derive consequential activities in terms of rights and obligations. They play a key role in metering, accounting and reporting and provide data for further analysis and refinement of SLAs in the analysis phase. SLA metrics are defined from a variety of disciplines, such as business process management, service and application management, or traditional systems and network management.
Different organizations have different definitions for crucial IT parameters such as Availability, Throughput, Downtime, Response Time, etc, for example some focus on the infrastructure (TCP connections) to define service availability, while others refer to the service application (ability to access the service application). Ambiguity, unfulfilled expectations and problems during the accomplishment of SLAs are the result. A poor choice of metrics will result in SLAs that are difficult to enforce automatically and may motivate the wrong behaviour. Currently, practitioners have almost no support in selecting the appropriate metrics for the implementation of successful SLAs (in terms of automation and compliance with the service objects and IT management processes) in order to automatically gauge the service performance. The paper does not attempt to define an exhaustive list of metrics that should be included in a SLA - the topic is too large by the enormous number of potential metrics and it varies as seen before from organization to organization and service to service. We propose a general categorisation scheme for typical metrics for basic service objects and IT management processes and populate it with metrics which commonly appear in SLAs. The metrics are derived from industrial requirements, i.e. they are taken from SLAs currently in use in an effort to provide realistic terms that are both useful and usable in particular for the automation of SLAs. To our knowledge, this is a first-of-a-kind approach and a multi-dimensional categorization of SLA contents and metrics is missing in literature. The contribution of the categorization is manifold. It supports SLA engineers in their design decision in particular concerning the specification of SLAs which are intended to be monitored and enforced automatically. During execution time it might contribute in root causes analysis identifying problems such as infrastructure instability, low-performance levels of service objects or poorly designed, critical IT processes for which responsible persons can be derived. Furthermore, it might be used to analyse existing SLAs indicating the extent to which an SLA is already oriented towards ITIL and if there is improvement potential.
Service Level Agreements This section gives an insight into Service Level Agreements and in general IT service contracts. It categorizes different types of service contracts, presents the main component parts and defines the goals in order to reach a common understanding. We first start with the definition of some terms used throughout the paper: • SLA metrics are used to measure the performance characteristics of the service objects. They are either retrieved directly from the managed resources such as servers, middleware or instrumented applications or are created by aggregating such direct metrics into higher-level composite metrics. Typical examples of direct metrics are the MIB variables of the IETF Structure of Management Information (SMI) such as number of invocations, system uptime, outage period or technical network performance metrics such as loss, delay, utilization etc. which are collected via measurement directives such as management interfaces, protocol messages, URIs etc. Composite metrics use a specific function averaging one or more metrics over a specific amount of time, e.g. average availability, or breaking them down according to certain criteria, e.g. maximum response time, minimum throughput, top 5%, etc. • Service Levels and Guarantees a.k.a. SLA rules represent the promises and guarantees with respect to graduated high/low ranges, e.g., average availability range [low: 95% , high: 99%, median: 97%], so that it can be evaluated whether the measured metrics exceed, meet or fall below the defined service levels at a certain time point or in a certain validity period. They can be informally represented as if-then rules which might be chained in order to form graduations, complex policies, and conditional guarantees, e.g., conditional rights and obligation with exceptions, violations, and consequential actions: “If the average service availability during on month is below 95% then the service the provider is obliged to pay a penalty of 20%.”. • IT Management Processes / ITIL Processes are IT management processes defining common practices in areas such as Incident, Problem, Configuration, Change or Service Level Management. • SLA (Service Level Agreement): An SLA is a document that describes the performance criteria a provider promises to meet while delivering a service. It typically also sets out the remedial actions and any penalties that will take effect if performance falls below the promised standard. It is an essential component of the legal contract between a service consumer and the provider. According to the Hurwitz Group the life cycle of an SLA is defined as follows: 28 The objectives of SLAs are manifold. In a nutshell, the substantial goals are: [Pa04] • Verifiable, objective agreements • Know risk distribution • Trust and reduction of opportunistic behavior • Fixed rights and obligations • Support of short and long term planning and further SLM processes • Decision Support: Quality signal (e.g. assessment of the new market participants) According to their intended purpose, their scope of application or their versatility SLAs can be grouped into different (contract) categories, e.g. Table 1: SLA categorization Intended Purpose Basic Agreement Defines the general framework for the contractual relationship and is the basis for all subsequent SLAs inclusive the severability clause. Service Agreement Subsumes all components which apply to several subordinated SLAs. Service Level Agreement Normal Service Level Agreement Operation Level Agreement (OLA) A contract with internal operational partners, which are needed to fulfill a superior SLA. Underpinning Contract (UC) A contract with an external operational partner, which are needed to fulfill a superior SLA. Scope of Application (according to Internal Agreement Rather an informal agreement than a legal contract In-House Agreement Between internal department or divisions External Agreement Between the service provider and an external service consumer Multi-tiered Agreement Including third parties up to a multitude of parties 1. SLA Design 2. Assign SLA owner 3. Monitor SLA compliance 4. Collect and analyze data 5. Improve the service provided 6. Refine SLA Fig. 1 SLA life cycle [St00] 29 Versatility (according to [Bi01]) Standard Agreement Standard contract without special agreements Extensible Agreement Standard contract with additional specific agreements Individual Agreement Customized, individual agreements Flexible Agreement Mixture of standard and individual contract A particular service contract might belong to more than one category, e.g. an Operation Level Agreement (OLA) might also be an individual in-house agreement. Several service contracts can be organized in a unitized structure according to a taxonomical hierarchy: Service Level Agreements come in several varieties and comprise differently technical, organizational or legal components. Table 2 lists some typical contents. Table 2: Categorization of SLA contents Technical Components Organizational Components Legal Components - Service Description - Service Objects - SLA/QoS Parameter - Metrics - Actions … - Liability and liability limitations - Level of escalation - Maintenance / Service periods - Monitoring and Reporting - Change Management … - Obligations to co-operate - Legal responsibilities - Proprietary rights - Modes of invoicing and payment.
SLA Metrics In order to develop a useful categorization scheme for IT metrics, we have spoken to close to three dozen IT service providers from small-and medium-sized enterprises to big companies and we have analyzed nearly fifty state-of-the-art SLAs currently used throughout the industry in the areas of IT outsourcing, Application Service Provisioning (ASP), Hardware Hosting, Service Suppliers and many other. One of the biggest problems we identified is the lack of rapport between metrics and service objects/IT processes as well as the lack of automation in SLA management and monitoring which is directly influenced by the underlying metrics and their ability to be automated. According to this observation we use three major categories to structure the field of SLA metrics: The service objects under consideration, ITIL processes and automation grade. The first category distinguishes basic service objects such as hardware, software, network etc. Composite metrics such as end-to-end availabilities can be broken down in smaller direct metrics which are assigned to one of these basic object types. The second category is organized around the eleven ITIL management processes. This leads to clear responsibilities and procedures and the metrics might reveal potential for process optimization. The last category deals with the question of measurability and therefore implicitly with the automation of metrics. It helps to find “easy-tocollect” metrics and to identify problematic SLA rules in existing SLAs, i.e. rules with metrics which can be measured manually only or which can not be measured at all. 31 In a nutshell, each category gives answers to different questions relating to design, implementation and analysis of SLAs, such as “Which metrics can be used for a particular service object?”, “Can the metric be automatically measured and what are the possible units?” or “Does a particular SLA sufficiently support the ITIL processes or is there improvement potential in terms of missing metrics?” etc. Furthermore, the combination of the categories helps to identify dependencies between used SLA resources and the performance of management processes and management tools.
Categorization according to Service Objects Although the particular end-to-end service objects may differ considerably among SLAs, they can mostly be reduced to five basic IT object classes, namely: Hardware, Software, Network, Storage and Help Desk (a.k.a. Service Desk). The respective instances can be combined in every combination in order to form complex and compound services such as e.g. ASP solutions including servers (hardware), applications such as SAP (software), databases or data warehouse (storage) and support (help desk). Each object class has its own set of typical quality metrics. In the following we present useful metrics in each class and give examples for their units.