Autonomous Clouds and Networks

Vision for the Cluster

An increasing amount of computing and information services are moving to the cloud, where they execute on virtualized hardware in private or public data centers. Hence, the cloud can be viewed as an underlying computing infrastructure for all systems of systems. The architectural complexity of the cloud is rapidly increasing. Modern data centers consist of tens of thousands of components, e.g., compute servers, storage servers, cache servers, routers, PDUs, UPSs, and air-conditioning units, with configuration and tuning parameters numbering in the hundreds of thousands. Currently, there is also an ongoing development of more modular computing nodes, often called disaggregated or rack-scale systems, where aggregated resources (like compute, memory, network, etc) of a large set of servers are treated as large pools of compute, memory, network, etc.

The same increasing trend holds for the operational complexity. The individual components are themselves increasingly difficult to maintain and operate. The strong connection between the components furthermore makes it necessary to tune the entire system, which is complicated by the fact that in many cases the behaviors, execution contexts, and interactions are not known a priori. The term autonomous computing or autonomic computing was coined by IBM in the beginning of the 2000s for self-managing computing systems with the focus on private enterprise IT systems. However, this approach is even more relevant for the cloud. The motivation is the current levels of scale, complexity, and dynamicity which make efficient human management infeasible. In autonomous cloud control, AI, and machine learning/analytics techniques will be used to dynamically determine how applications should be best mapped onto the server network, how capacity should be automatically scaled when the load or the available resources vary, and how load should be balanced.

Currently there is also a growing interest in applying cloud techniques, such as virtualization and collocation, in the access telecommunication network itself. The unification of the telecom access network and the traditional cloud data centers, sometimes referred to as the distributed cloud or edge cloud, provide a single distributed computing platform. Here the boundary between the network and the data centers disappears, allowing application software to be dynamically deployed in all types of nodes, e.g., in edge nodes, e.g., base stations, near end-users, in remote large-scale datacenters, or anywhere in between. In these systems the need for autonomous operation and resource management becomes even more urgent as heterogeneity increases, when some of the nodes may be mobile with varying availability, and when new 5G-based mission-critical applications with harder requirements on latency, uptime, and availability are migrated to the cloud.

Research Challenges

In the cluster distributed control and real-time analytics will be used to dynamically solve resource management problems in the distributed cloud. The management problem consists of deciding the types and quantities of resources that should be allocated to each application, and when and where to deploy them. This also includes dynamic decisions such as automatic scaling of the resource amount when the load or the available resources vary, and on-line migration of application components between nodes. Major scientific challenges include

  • How to create models for the distributed cloud infrastructure, useful for both system design and optimization of resource and application management?
  • How to model and predict workloads, including variations in both time and locality?
  • How to perform on-line distributed analytics for creating the dynamically maintained knowledge base needed for optimization of resource and application management?
  • How to best use model-based feedback control and distributed control for controlling both system throughput and end-user response times?
  • How to design an autonomous management system that dynamically maintains a sufficient degree of decentralization to handle the scale while still being able to make highly optimized management decisions?
  • How to, in such a system, perform the most fundamental management optimizations, like vertical and horizontal capacity scaling, geo-placement of application components, service differentiation, and management?
  • How to design and manage a distributed cloud to meet the requirements of mission critical applications?
  • How to integrate and tailor such an autonomous system to intrinsically distributed and dynamic applications with massive data producers, such as, e.g., video cameras for surveillance and supervision?
  • How to perform resource management and orchestration in disaggregated hardware architectures?

The interdependencies between these questions and the expected interactions in order to solve the problems can be illustrated as follows. In order to develop efficient methods for resource management, it is crucial to understand the performance aspects of the infrastructure, what the workloads look like, and how they vary over time. Due to user mobility and variations in usage and resource availability, applications using many instances are constantly subject to changes in the number of instances; the individual instances relocated or resized; the network capacity adjusted; etc. Hence, infrastructure modeling and workload modeling for the distributed cloud are fundamental. On-line analytic and learning based on extreme amounts of monitoring data can create knowledge to be used in autonomous management. Capacity autoscaling is needed to determine how much capacity should be allocated for a complete application or any specific part of it; and Dynamic geo-placement complements by determining when, where, and how instances should be relocated, e.g., from a data center to a specific base station; Since not all applications are equally important, e.g., due to differently priced service levels or due to some being critical to society (emergency, health care, etc.), all management must take into account Quality of Service differentiation, and for mission critical applications, there is also a need to intelligently provide redundancy and fall-back solutions. The management systems themselves needs to be capable at handling systems of extreme scale, and to handle both the highly distributed infrastructure and the individual nodes, including so called rack-scale systems.

Industrial Challenges

Sweden has a long industrial tradition in telecommunication and recently there has been a rapid increase in the cloud industry, from large-scale datacenter establishments to start-ups providing novel cloud services. Most industries, however, have difficulties in addressing the ambitious long-term basic research challenges required to carry out truly game changing technology shifts. We believe our proposed research will provide great opportunities for such industries. The cluster has particularly strong connections to Ericsson Research who claims that the cluster “addresses important scientific problems that have to be solved in order to establish the distributed cloud with dependable performance, low latency, and minimal environmental impact. This is an important enabler for the networked society.” However, once the autonomous cloud is in place it will open up new applications for a lot of other industry sectors, e.g., process automation and cloud robotics, where part of the optimization-based computations can be moved into the cloud, and automated transport systems where fleet management operations as well as individual vehicle optimizations can be performed in the cloud. The cluster connects to several of the other WASP clusters in different ways. An immediate connection is the fact that cloud technology is of interest in almost all of the clusters as a way of implementing the compute-intensive parts of their systems. Examples of this are the two clusters “Automated Transport Systems” and “Interaction and Communication with Sensor-Rich Autonomous Agents”. The second and most important connection, though, is that other WASP clusters may provide traffic with the requirements that the distributed edge cloud is designed to support, i.e., low-latency, high bandwidth, mobility, and dynamically changing QoS requirements. This could be the case, e.g., for “Automated Transport Systems” and for “Perception and Learning in Interactive Autonomous Systems”.

Sub-projects

The personnel includes the following researchers:

  • Erik Elmroth, (cluster coordinator), Umeå University
  • Karl-Erik Årzén, Lund University
  • Maria Kihl, Lund University
  • Dejan Kostic, KTH
  • Martina Maggio, Lund University
  • Philipp Leitner, Chalmers
  • Anton Cervin, Lund University
  • Marina Papatriantafilou, Chalmers
  • Johan Eker, Lund University

The PhD students subprojects are the following:

Testing Autonomous Control-Based Software Systems
Claudio Mandrioli (PhD student, Lund University), Martina Maggio (advisor, Lund University), Karl-Erik Årzén (co-advisor, Lund University)

Self-Adaptive software usually comprises the software itself and an adaptation layer, in charge of observing the current execution conditions and reacting to these conditions with changes in the software behavior. The adaptation layer is often realized with control-theoretical techniques, to exploit the large set of guarantees that control-based adaptation provides. Properly testing these systems is a complex problem. First, the control strategy should be verified on its own to assess the formal guarantees that it entails. Second, it should be possible to verify that the introduction of control theory does not influence the behavior of the software in terms of functional properties. Third, the formal guarantees that the control-theoretical adaptation offers should be verified in practice when the controller is connected to the software system. The project proposes the study of testing for self-adaptive software where the adaptation layer is based on control-theoretical principles.

 

Autonomous learning camera systems in resource constrained environments
Alexandre Martins (Industrial PhD student, Lund University, Axis Communications), Mikael Lindberg (advisor, Axis Communications), Karl-Erik Årzén (advisor, Lund University), Martina Maggio (co-advisor, Lund University)

The future networked society will contain a huge number of devices, many of them processing a very large amount of sensor data. One example of this is distributed video cameras in surveillance and supervision applications. Due to efficiency and price constraints the communication and computing platforms are often limited, hence dynamic resource management is required. This project aims to turn camera systems into a swarm of autonomous scene-learning devices that share the same resources,  turning today’s central server as a viewing-only client. The systems will make sure that available resources are dynamically and optimally allocated at all time. The swarm will be completely flexible allowing devices to be added or removed from it and re-allocating resources accordingly. Each of these devices will be communicating with its surroundings, and, will in the process learn situation specific parameters, such as resources availability and expenditure, scene properties etc, in order to predict future resource needs and allow for superior system wide resource management.

 

Autonomous control of a network of smart services
Victor Millnert (PhD student, Lund University), Johan Eker (advisor, Lund University, Ericsson Research), Karl-Erik Årzén (co-advisor, Lund University)

This research project revolves around the question of how to control a network of smart services so that applications requiring low and predictable end-to-end latencies can be hosted on it. To achieve this develop new models and theory for controlling the processing capacity and the admission control of the smart services. Along with this we investigate how to change the configurations of the smart services in order to allow applications to join and leave the network in a dynamic manner, without sacrificing the predictability of the performance.

 

Autonomous Resource Management in Mobile Edge Cloud
Chanh Nguyen (PhD student, Umeå University), Erik Elmroth (advisor, Umeå University), Cristian Klein (co-advisor, Umeå University).

The emerging Mobile Edge Cloud platform where traditional centralized cloud complementing with IT capability distributed in the proximity of the end-users has eased the ultra low latency and bandwidth-hungry requirements of IoT and mobile applications. However, the intertwined challenges of user mobility behavior, the dense distribution and heterogeneous scale of resource capacity brings significant complication to this computing paradigm in controlling resource allocations to optimize cost and energy efficiency, while guaranteeing the expected user experience. An efficient resource allocation system will not only have the requirements of understanding and predicting workload dynamics in terms of increasing and decreasing load, but also require understanding and predictions the mobility of the load.

Our research is aimed to design an autonomous management system for Mobile Edge Cloud by addressing aforementioned challenges. The research’s goal is a self-adaptive, self-optimizing management system that continuously (1) monitors system and application behavior; predicts near future changes through machine learning and data analytics; and (2) optimizes resource allocation through optimization and feedback control, while at the same time (3) learning from the impact of its own behavior.

 

Control-based Resource Management in the Cloud
Tommi Nylander (PhD student, Lund University), Karl-Erik Årzén (advisor, Lund University), Maria Kihl (co-advisor, Lund University), Martina Maggio (co-advisor, Lund University)

In the project control and real-time analytics are used to dynamically solve resource management problems in the distributed cloud. The management problem consists of deciding the types and quantities of resources that should be allocated to each application, and when and where to deploy them. This also includes dynamic decisions such as automatic scaling of the resource amount when the load or the available resources vary, and on-line migration of application components between nodes. Major scientific challenges include dynamic modeling of cloud infrastructure resources and workloads, how to best integrate real-time analytics techniques with model-based feedback mechanisms, scalable distributed control approaches for these types of applications and scalability aspects of distributed computing.

 

Design, Optimization and Control of Self-driving and Autonomous Networked Systems
Haorui Peng (PhD student, Lund University), Maria Kihl (advisor, Lund University), Martina Maggio (co-advisor, Lund University), Karl-Erik Årzén (co-advisor, Lund University)

The research will be focused on algorithms and theories for optimizing autonomous networked systems, which is envisioned for being the essential infrastructure for mission-critical services in IoT and mobile edge cloud scenarios. The network will typically include wireless access based on 5G technologies with wired backhaul and core.

The system is aimed to provide high-reliability and utral-latency connectivity required by mission-critical services and IoT applications (eg. smart city, autonomous vehicles and cloud robotics). However the latency would highly affected by the end-to-end physical distances, network congestions and the access capability of the system, etc. The key research topics will then be resource management and decentralized control in order to meet the latency and resource requirements.

 

STAMINA: Processing and analysis of data STreams in Advanced Metering INfrastructures for Awareness and Adaptiveness in electricity grids
Joris van Rooij (Industrial PhD, Göteborg Energi, Göteborg), Peter Berggren (advisor, Göteborg Energi), Marina Papatriantafilou (advisor, Chalmers), Vincenzo Gulisano (co-advisor, Chalmers)

Sweden was one of the first countries to introduce the Advanced Metering Infrastructure (AMI) in electricity networks by deploying smart meters in 2009. The introduction of this infrastructure implies the existence of large volumes of data, which, with appropriate processing, interpretation and use can enable functionality related with resource planning, adaptiveness and fault-tolerance in electricity networks. For example, it is envisioned that with the right information in-hand it can be possible to (i) raise early warnings and react in emergency situations, (ii) to aggregate power where/when needed, (iii) to help customers/components make scheduling decisions for improved use of electricity power, and more.

The goals of the scientific investigation is to develop methods and algorithms that enable the utility company to better use the infrastructure as well as developing knowledge to better understand future electricity networks. We will study parallel and distributed computation methods to calculate real-time or close to real-time estimates of important parameters, such as distribution losses.

 

Autonomous network resource management for disaggregated DCs.
Amir Roozbeh (Industrial PhD student, KTH, Ericsson), Fetahi wuhib (advisor, Ericsson), Dejan Kostic (advisor, KTH), Gerald Q. Maguire Jr.  (co-advisor, KTH)

Software-Defined “Hardware” Infrastructures (SDHI) break the physical server-oriented model via hardware disaggregation. Disaggregation means that each type of resource in the DC is seen as a resource pool and the DC’s fast interconnect network directly interconnects all of these resources. This new approach brings greater modularity, flexibility, and extensibility to DC infrastructures, allowing operators to employ resources more efficiently. Networking plays a pivotal role in a SDHI architecture, being both an enabler and blocking factor. Until recently, networking technologies were unable to overcome the need to place the different components very close in terms of physical proximity, i.e., due to strict requirements on latency and bandwidth to interconnect different components. In this project, we would investigate the possibilities of realizing fast networking that can meet the requirements SDHI by the smart and autonomous management of resources.

 

Event-based Information Fusion for the Self-Adaptive Cloud
Johan Ruuskanen (PhD student, Lund University), Anton Cervin (advisor, Lund University), Karl-Erik Årzén (co-advisor, Lund University)

The idea of the self-adaptive cloud is to handle workload variations and structural changes by regulating the resources provided to the cloud service. The goal is to provide just the right amount of computing resources at all times, so that the cost is minimized while still maintaining good performance. This can be viewed as a classical feedback control loop, where the cloud service is the plant under control and the adaptation mechanism is the controller.
However, measurements in cloud systems are not continuous signals but rather discrete events. New information is available only when something happens in the system, for instance when a new customer arrives or when a request is completed. Further, successful control often relies on accurate tracking of the system states, many of which are not directly measurable in real applications. Instead the unknown states and parameters of the cloud system needs to be estimated by using a model of the system together with the various measurements.

This information fusion problem becomes challenging because of the non-linear behavior of the cloud service and because new information is only available at discrete events. The project thus aims at developing novel, event-based estimation techniques for information fusion in cloud server systems.

 

Software Developer Targeted Performance and Cost Management for Autonomic Cloud Applications
Joel Scheuner (PhD student, Chalmers | University of Gothenburg), Philipp Leitner (co-advisor, Chalmers | University of Gothenburg), Ivica Crnkovic (advisor, Chalmers | University of Gothenburg)

In this project, we will research methods and mechanisms to enable software developers to proactively manage the quality of such services. We will concentrate on two quality aspects: software performance (e.g., the management and prediction of non-functional service properties, such as response time or scalability), and deployment costs. These quality aspects are in practice highly linked: lacking software performance is often at the root of quality issues of cloud services, and the current state of practice to deal with cloud performance issues is to overprovision, leading to monetary overspending which could be avoided with higher software quality or better predictions.

 

Mission Critical Cloud
Per Skarin (Industrial PhD student, Lund University, Ericsson), Johan Eker (advisor, Ericsson), Karl-Erik Årzén (advisor, Lund University), Maria Kihl (co-advisor, Lund University), Martina Maggio (co-advisor, Lund University)

Cloud technology has swiftly transformed the ICT industry and it is continuing to spread. Many ICT applications are suitable for cloud deployment in that they have relaxed timing or performance requirements. In order to take the cloud concepts beyond the ICT domain and apply it to mission critical use cases such as industrial automation, transport and health care we must provide guarantees and predictability. To this end we need new tools and new ways of working. This project attacks this problem from two angles. We will work at developing a cloud infrastructure with a deterministic behaviour, thereby suitable for critical applications. Zero-touch configuration of the cloud based on feedback is a fundamental building block in our approach. Secondly we will showcase the viability of the hardened cloud through mission critical cloud application running in a real data center and operating real-world process, e.g. robotics, unmanned vehicles.

 

Anomaly detection in future RAN
Tobias Sundqvist (Industrial PhD, Tieto Umeå), Erik Elmroth (advisor, Umeå University), Monowar H Bhuyan (co-advisor, Umeå University), Johan Forsman (co-advisor, Tieto)

The future 5G Radio Access Network (RAN) will become a very complex system due to the new architecture where parts are virtualized and distributed in multiple data centers. In order to help troubleshooters to understand the system and shorten time to find faults we will collect set of metrics from the future RAN and use machine learning techniques to detect and diagnose faults or anomalies.

This research work will investigate different issues related to anomaly detection in future RAN. For example: what metrics are needed and when to collect them, what machine learning techniques can be used to detect anomalies, what kind of anomalies are possible to detect, what anomalies is easy or hard for a machine versus a user to detect, how can we find the root cause of the anomalies, how can we visualize the RAN system so that it will be easier to troubleshoot for a user? Since there are no live 5G network for real time experiments we will use  the facilities available at Tieto in Umeå.

 

Activities

Besides short cluster meetings held, e.g., at the WASP Winter Conference, the cluster frequently meet at longer scientific meetings. An initial cluster meeting was held at the 8th Cloud Control Workshop organized in Lövånger, Feb 1-3, 2016. After that cluster meetings have been held in conjunction with the 9th Cloud Control Workshop, June 27-29 at Friiberghs Herrgård by the Mälaren lake, with the 10th Cloud Control Workshop, Feb 27 – Mar 1, 2017 at Strömbäcks Folkhögskola, Umeå, with the 11th Cloud Control Workshop, June 12-14, 2017, at Haga Slott, by the Mälaren lake, and the the 12th Cloud Control Workshop, Mar 19-21, 2018 at Lövånger and with the 13th Cloud Control Workshop, June 13-15, 2018 at Skåvsjöholm in the Stockholm archipelago.