Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 30 additions & 7 deletions docs/.nav.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,37 @@ nav:
- Quickstart: quickstart.md
- Installation: installation.md
- Design:
- design/*.md
- MLConnector:
- mlconnector/Overview.md
- mlconnector/Installation.md
- mlconnector/Step-by-step guide.md
- Architecture: design/architecture.md
- Agents:
- Multi-layer Agents: design/agents.md
- MAPE Tasks: design/mape.md
- SPADE: design/spade.md
- Controllers: design/controllers.md
- Plugins:
- Plugin System: design/plugins/plugin_system.md
- Policy Plugins: design/plugins/policy_plugins.md
- Mechanism Plugins: design/plugins/mechanism_plugins.md
- Descriptions:
- Application Description: design/application-description.md
- System Description: design/system-description.md
- Telemetry: design/telemetry.md
- Agent Configuration: design/agent-configuration.md
- ML Connector: design/ml-connector.md
- User Guide:
- Application Description:
- System Description:
- Policy Plugins:
- Mechanism Plugins:
- MLConnector:
- mlconnector/Overview.md
- mlconnector/Installation.md
- mlconnector/Step-by-step guide.md
- Developer Guide:
- developer-guide/*.md
- Tutorials:
- tutorials/*.md


- References:
- Python Telemetry API Reference: references/telemetrysdk.md
- Northbound API Reference: references/northbound-api.md
- ML Connector API Reference: references/ml-connector.md
- Command-line Interfaces: references/cli.md
Empty file modified docs/CNAME
100644 → 100755
Empty file.
Binary file added docs/assets/img/EN-Funded.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/agent_blocks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/agent_high.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/app_description_sequence.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/cluster_telemetry.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/assets/img/concept.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/cont_telemetry.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/hb_messages.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/high_level_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified docs/assets/img/mlsysops-logo.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/mlsysops_logo700x280.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/node_telemetry.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/otel_deploy_sequence.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/plugin_exec_flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/system_description_sequence.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/telemetry_high.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/telemetry_pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified docs/assets/javascripts/console-copy.js
100644 → 100755
Empty file.
Empty file modified docs/assets/stylesheets/theme.css
100644 → 100755
Empty file.
32 changes: 32 additions & 0 deletions docs/design/agent-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Each agent uses a configuration file that defines its behaviour during instantiation. While agents operating at
different layers of the continuum instantiate different components of the core MLSysOps framework, all agents running on
nodes use the same base instance. However, since node characteristics may vary significantly, each agent can be
individually configured using its corresponding configuration file.

```YAML
telemetry:
default_metrics:
- "node_load1"
monitor_data_retention_time: 30
monitor_interval: 10s
managed_telemetry:
enabled: True

policy_plugins:
directory: "policies"

mechanism_plugins:
directory: "mechanisms"
enabled_plugins:
- "CPUFrequencyConfigurator"

continuum_layer: "node"

system_description: 'descriptions/rpi5-1.yaml'

behaviours:
APIPingBehaviour:
enabled: False
Subscribe:
enabled: False
```
36 changes: 36 additions & 0 deletions docs/design/agents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
The agent component forms the core of the MLSysOps framework. It provides essential integration logic across all layers,
connecting the configuration mechanisms of the underlying system, telemetry data collected from various system
entities (e.g., application, infrastructure), and system configuration policies. Figure 32 illustrates the high-level
architectural structure of the agent. The component exposes two interfaces—the Northbound and Southbound APIs—which
offer structured methods for different system users to interact with it. The Northbound API targets application and
policy developers, whereas the Southbound API is primarily intended for system administrators and mechanism providers.


<img src="../../assets/img/agent_high.png" width="600" style="margin:auto; display:block;"/>

The agent follows MAPE (Monitor-Analyze-Plan-Execute) paradigm, which was proposed in 2003 [55] to manage autonomic
systems given high-level objectives from the system administrators, by using the same notion for the main configuration
tasks, depicted as MAPE Tasks in Figure 32. The programming language of choice is Python, and leverages SPADE Python
multi-agent framework [56] to form a network of agents that can communicate through XMPP protocol and a set of defined
messages, providing any necessary functionality from internal tasks that are called behaviours. To achieve seamless
operation between the various sub-modules, the agent implements a set of controllers that are responsible for managing
the various external and internal interactions.
One important design goal of the agent was extensibility. This goal is achieved by defining simple yet powerful
abstractions for two important actors interacting with the system: on one side, the policy developer, who implements the
core management logic, and on the other side, the mechanism provider, who exposes the available configuration options
for a subsystem. Both abstractions are integrated into the MLSysOps agent as plugin functionalities, specifically named
policy and mechanism plugins. The agent's analysis, planning, and execution tasks depend on this plugin system to
generate intelligent configuration decisions—provided by the installed policy plugins—and to apply those decisions to
the underlying system via the available mechanism plugins.

<img src="../../assets/img/agent_blocks.png" width="600" style="margin:auto; display:block;"/>

The agent software is structured into different module types:

- Core Module – Provides foundational functionalities shared by all agent instances (continuum, cluster, and node).
- Layer-Specific Modules – Offer customized implementations specific to the roles of continuum, cluster, or node agents.
- External Interface Modules – Facilitate interactions between the agent framework and external entities. These modules
include the CLI, Northbound API, ML Connector, policy and mechanism plugins.

This modular architecture ensures consistency in core functionalities across all agents, while also supporting
customization and extension for specific layers and external interactions.
14 changes: 14 additions & 0 deletions docs/design/application-description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
The application owner, one of the main actors, interacts with MLSysOps by submitting the application description using
the Command Line Interface (CLI) provided by the framework. The application description depicts the required deployment
constraints (e.g., node-type, hardware, sensor requirements, etc.), which enable various filtering options for the
continuum and cluster layers, that can decide the candidate clusters and nodes, respectively. Having as an example the
registration of a given application, as shown in Figure 42, we perform a Top-Down propagation of the necessary
information to each layer of the continuum. Initially, the Continuum agent creates a Kubernetes Custom Resource that is
propagated to the available Kubernetes clusters. The Cluster agents follow the Kubernetes Operator pattern, so they are
notified of application creation, update, or removal events. Each Cluster agent manages the components that match its
cluster ID, if any. This information is provided by the Continuum agent in the application's Custom Resource. A given
Cluster agent captures the application creation event, parses the description, and deploys the components based on the
provided requirements. The component specifications are also sent to their host nodes, so that the Node agents can store
relevant fields required for any potential reconfiguration/adaptation.

![app_description_sequence.png](../assets/img/app_description_sequence.png)
79 changes: 79 additions & 0 deletions docs/design/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Architecture

MLSysOps introduces a hierarchical agent-based architecture composed of three levels:
- Node Agents reside on individual nodes and expose configuration interfaces, monitor resource usage, and provide direct
access to telemetry.
- Cluster Agents coordinate groups of nodes, aggregate telemetry, and issue deployment decisions or adaptation
instructions.
- The Continuum Agent sits at the top level, interfacing with external stakeholders (via northbound APIs), receiving
high-level intents and application descriptors, and coordinating decision-making across slices.
Each layer operates a Monitor–Analyze–Plan–Execute (MAPE) control loop, enabling autonomous adaptation based on local
and global telemetry, system optimization targets, and ML-driven policies. Importantly, this architecture separates
management logic from resource control, allowing for modular evolution and system introspection.

The MLSysOps agents, supported by ML models, analyse, predict, and optimize resource usage patterns and overall system
performance by allocating, monitoring and configuring the different resources of the underlying layers via the
mechanisms that are implemented in the context of WP3 and manifested in the current deliverable. This integration is a
collaborative effort that draws on the diverse expertise of project partners, each contributing unique insights and
solutions to the multifaceted challenges of cloud and edge computing. This collaborative approach is complemented by an
iterative development process characterized by continuous testing and feedback loops. Such a process ensures that the
mechanisms developed are not only effective in their current context but are also scalable and adaptable to future
technological advancements and operational needs.

<img src="../../assets/img/arch.png" width="600"/>

The following figure depicts a comprehensive illustration of the MLSysOps hierarchical agent system's placement
and its interactions with two other fundamental subsystems: container orchestration and telemetry. This agent hierarchy
is structured in line with the orchestration architecture, and it is logically divided into three tiers. The
communication among the three subsystems (agents, container orchestration, and telemetry) is facilitated through
designated interfaces at each tier. Moreover, the agent system engages with the continuum level's system agents and
integrates plug-in configuration policies that can use ML models at all levels. At every level, agents utilize mechanism
plugins to implement commands for adjusting available configuration and execution mode options.

<img src="../../assets/img/high_level_arch.png" width="600"/>

Node-level agents’ interface with local telemetry systems and expose configuration knobs. Cluster-level agents
coordinate resource allocation decisions across groups of nodes. At the top level, the continuum agent handles global
orchestration, provides APIs to external actors, and aggregates telemetry data. ML-driven decisions can be made at every
layer, using information for the respective layer. This layered approach facilitates scalability and separation of
concerns while supporting collaboration across orchestration, telemetry, and ML systems. The agent infrastructure
interacts through three distinct types of interfaces. The Northbound API provides access to application developers and
system administrators. The Southbound API interfaces with the underlying telemetry collection and configuration
mechanisms. The ML Connector allows ML models to be plugged into the framework and invoked for training, prediction, and
explanation tasks.
The telemetry subsystem is built upon the OpenTelemetry specification and is responsible for collecting and processing
metrics, logs, and traces. These are abstracted into hierarchical telemetry streams that feed the decision logic of the
agents and the ML models. Data collection happens at the node level, where individual collectors expose metrics either
in raw or aggregated formats. These are processed through transformation pipelines and propagated to cluster and
continuum levels for higher-level aggregation and analysis.

Application deployment and orchestration are driven by declarative descriptions submitted by application developers and
administrators. These descriptions capture the application's structure, resource requirements, and quality-of-service
objectives. Deployment is handled through standard container orchestration tools, which are extended by the MLSysOps
framework to support advanced placement decisions and runtime adaptation. For far-edge deployments, the framework
introduces a proxy-based architecture involving embServe on constrained devices and a virtual orchestrator service
running inside containerized environments. This approach allows resource-constrained devices to be seamlessly integrated
into the same orchestration and telemetry flows as more capable edge and cloud nodes.

The object storage infrastructure builds upon and extends SkyFlok, a secure and distributed storage system. In MLSysOps,
this infrastructure supports adaptive reconfiguration of bucket policies based on real-time telemetry and application
usage patterns. The storage system exposes telemetry data regarding latency, bandwidth, and access frequency, enabling
agents and ML models to optimize redundancy and placement decisions without disrupting ongoing operations.
The framework also includes specialized subsystems for anomaly detection and trust assessment. These modules analyze
telemetry data to identify attacks or malfunctions and classify anomalies using ML models. Their outputs are exposed
through the telemetry interface and used by higher-level agents to trigger remediation strategies or adapt orchestration
plans. Trust levels for nodes are computed using a combination of identity, behaviour, and capability metrics, forming a
reputation-based model that influences agent decision-making.

ML models play a central role in enabling the autonomic operation of the framework. Each level of the agent hierarchy
may employ one or more models, which are integrated via the ML Connector API. These models receive structured telemetry
input and produce configuration decisions, which are interpreted and enacted by the agents. The framework supports
reinforcement learning, continual learning, and federated learning scenarios. In addition, explainability mechanisms are
integrated into the ML workflows to allow system administrators and application developers to understand and audit the
decisions made by the models.

MLSysOps effectively manages operations by leveraging telemetry data collected from each level, which provides essential
insights. This data, combined with machine learning models, enhances the decision-making process, aligning with both the
application's objectives and the system's requirements. Actions based on these decisions are cascaded and refined from
the top level downwards. The final status and outcomes of these decisions are then made accessible to system Actors. The
design and functionality of the telemetry system are further explaiend in [Telemetry system design](design/telemetry).
21 changes: 21 additions & 0 deletions docs/design/controllers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Controllers are responsible for coordinating all internal components of the framework, including the MAPE tasks, SPADE,
Policy and Mechanism Plugins, and the Northbound and Southbound APIs.

- **Application Controller**: Manages the lifecycle of the Analyze loop for each application submitted to the system. When a
new application is submitted, a corresponding Analyze behaviour is initiated, and it is terminated when the application
is removed.

- **Policy & Mechanism Plugin Controllers**: Responsible for loading, initializing, and configuring policy and mechanism
plugins. During runtime, these controllers provide updated information to the Application Controller, reflecting any
changes in the policy API files.

- **Agent Configuration Controller**: Handles external configuration commands received from other agents or via the Northbound
API, and propagates them to the appropriate internal components. It is also responsible for loading the initial
configuration file during startup.

- **Telemetry Controller**: Manages the OpenTelemetry Collector for each agent, including initial deployment and runtime
configuration. Since each collector operates as a pod within the cluster, the Node Agent coordinates with the Cluster
Agent to request deployment and updates, as depicted in Figure 41. Additionally, this controller configures the Monitor
task based on the telemetry metrics being collected.

![otel_deploy_sequence.png](../assets/img/otel_deploy_sequence.png)
Empty file modified docs/design/index.md
100644 → 100755
Empty file.
Loading