mlsysops-eu · ananos · Jul 4, 2025 · Jun 18, 2025 · Jun 27, 2025 · Jul 3, 2025
diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -5,14 +5,37 @@ nav:
       - Quickstart: quickstart.md
   - Installation: installation.md
   - Design:
-      - design/*.md
-  - MLConnector:
-      - mlconnector/Overview.md
-      - mlconnector/Installation.md
-      - mlconnector/Step-by-step guide.md
+      - Architecture: design/architecture.md
+      - Agents:
+          - Multi-layer Agents: design/agents.md
+          - MAPE Tasks: design/mape.md
+          - SPADE: design/spade.md
+          - Controllers: design/controllers.md
+      - Plugins:
+          - Plugin System: design/plugins/plugin_system.md
+          - Policy Plugins: design/plugins/policy_plugins.md
+          - Mechanism Plugins: design/plugins/mechanism_plugins.md
+      - Descriptions:
+          - Application Description: design/application-description.md
+          - System Description: design/system-description.md
+      - Telemetry: design/telemetry.md
+      - Agent Configuration: design/agent-configuration.md
+      - ML Connector: design/ml-connector.md
+  - User Guide:
+      - Application Description:
+      - System Description:
+      - Policy Plugins:
+      - Mechanism Plugins:
+      - MLConnector:
+        - mlconnector/Overview.md
+        - mlconnector/Installation.md
+        - mlconnector/Step-by-step guide.md
   - Developer Guide: 
       - developer-guide/*.md
   - Tutorials:
       - tutorials/*.md
-
-
+  - References:
+    - Python Telemetry API Reference: references/telemetrysdk.md
+    - Northbound API Reference: references/northbound-api.md
+    - ML Connector API Reference: references/ml-connector.md
+    - Command-line Interfaces: references/cli.md
diff --git a/docs/CNAME b/docs/CNAME
diff --git a/docs/assets/img/EN-Funded.png b/docs/assets/img/EN-Funded.png
diff --git a/docs/assets/img/agent_blocks.png b/docs/assets/img/agent_blocks.png
diff --git a/docs/assets/img/agent_high.png b/docs/assets/img/agent_high.png
diff --git a/docs/assets/img/app_description_sequence.png b/docs/assets/img/app_description_sequence.png
diff --git a/docs/assets/img/arch.png b/docs/assets/img/arch.png
diff --git a/docs/assets/img/cluster_telemetry.png b/docs/assets/img/cluster_telemetry.png
diff --git a/docs/assets/img/concept.svg b/docs/assets/img/concept.svg
diff --git a/docs/assets/img/cont_telemetry.png b/docs/assets/img/cont_telemetry.png
diff --git a/docs/assets/img/hb_messages.png b/docs/assets/img/hb_messages.png
diff --git a/docs/assets/img/high_level_arch.png b/docs/assets/img/high_level_arch.png
diff --git a/docs/assets/img/mlsysops-logo.png b/docs/assets/img/mlsysops-logo.png
diff --git a/docs/assets/img/mlsysops_logo700x280.png b/docs/assets/img/mlsysops_logo700x280.png
diff --git a/docs/assets/img/node_telemetry.png b/docs/assets/img/node_telemetry.png
diff --git a/docs/assets/img/otel_deploy_sequence.png b/docs/assets/img/otel_deploy_sequence.png
diff --git a/docs/assets/img/plugin_exec_flow.png b/docs/assets/img/plugin_exec_flow.png
diff --git a/docs/assets/img/system_description_sequence.png b/docs/assets/img/system_description_sequence.png
diff --git a/docs/assets/img/telemetry_high.jpg b/docs/assets/img/telemetry_high.jpg
diff --git a/docs/assets/img/telemetry_pipeline.png b/docs/assets/img/telemetry_pipeline.png
diff --git a/docs/assets/javascripts/console-copy.js b/docs/assets/javascripts/console-copy.js
diff --git a/docs/assets/stylesheets/theme.css b/docs/assets/stylesheets/theme.css
diff --git a/docs/design/agent-configuration.md b/docs/design/agent-configuration.md
@@ -0,0 +1,32 @@
+Each agent uses a configuration file that defines its behaviour during instantiation. While agents operating at
+different layers of the continuum instantiate different components of the core MLSysOps framework, all agents running on
+nodes use the same base instance. However, since node characteristics may vary significantly, each agent can be
+individually configured using its corresponding configuration file.
+
+```YAML
+telemetry:
+  default_metrics: 
+      - "node_load1"
+  monitor_data_retention_time: 30
+  monitor_interval: 10s
+  managed_telemetry:
+    enabled: True
+
+policy_plugins:
+  directory: "policies"
+
+mechanism_plugins:
+  directory: "mechanisms"
+  enabled_plugins:
+   - "CPUFrequencyConfigurator"
+
+continuum_layer: "node"
+
+system_description: 'descriptions/rpi5-1.yaml'
+
+behaviours:
+  APIPingBehaviour:
+    enabled: False
+  Subscribe:
+    enabled: False
+```
diff --git a/docs/design/agents.md b/docs/design/agents.md
@@ -0,0 +1,36 @@
+The agent component forms the core of the MLSysOps framework. It provides essential integration logic across all layers,
+connecting the configuration mechanisms of the underlying system, telemetry data collected from various system
+entities (e.g., application, infrastructure), and system configuration policies. Figure 32 illustrates the high-level
+architectural structure of the agent. The component exposes two interfaces—the Northbound and Southbound APIs—which
+offer structured methods for different system users to interact with it. The Northbound API targets application and
+policy developers, whereas the Southbound API is primarily intended for system administrators and mechanism providers.
+
+
+<img src="../../assets/img/agent_high.png" width="600" style="margin:auto; display:block;"/>
+
+The agent follows MAPE (Monitor-Analyze-Plan-Execute) paradigm, which was proposed in 2003 [55] to manage autonomic
+systems given high-level objectives from the system administrators, by using the same notion for the main configuration
+tasks, depicted as MAPE Tasks in Figure 32. The programming language of choice is Python, and leverages SPADE Python
+multi-agent framework [56] to form a network of agents that can communicate through XMPP protocol and a set of defined
+messages, providing any necessary functionality from internal tasks that are called behaviours. To achieve seamless
+operation between the various sub-modules, the agent implements a set of controllers that are responsible for managing
+the various external and internal interactions.
+One important design goal of the agent was extensibility. This goal is achieved by defining simple yet powerful
+abstractions for two important actors interacting with the system: on one side, the policy developer, who implements the
+core management logic, and on the other side, the mechanism provider, who exposes the available configuration options
+for a subsystem. Both abstractions are integrated into the MLSysOps agent as plugin functionalities, specifically named
+policy and mechanism plugins. The agent's analysis, planning, and execution tasks depend on this plugin system to
+generate intelligent configuration decisions—provided by the installed policy plugins—and to apply those decisions to
+the underlying system via the available mechanism plugins.
+
+<img src="../../assets/img/agent_blocks.png" width="600" style="margin:auto; display:block;"/>
+
+The agent software is structured into different module types:
+
+- Core Module – Provides foundational functionalities shared by all agent instances (continuum, cluster, and node).
+- Layer-Specific Modules – Offer customized implementations specific to the roles of continuum, cluster, or node agents.
+- External Interface Modules – Facilitate interactions between the agent framework and external entities. These modules
+include the CLI, Northbound API, ML Connector, policy and mechanism plugins.
+
+This modular architecture ensures consistency in core functionalities across all agents, while also supporting
+customization and extension for specific layers and external interactions.
diff --git a/docs/design/application-description.md b/docs/design/application-description.md
@@ -0,0 +1,14 @@
+The application owner, one of the main actors, interacts with MLSysOps by submitting the application description using
+the Command Line Interface (CLI) provided by the framework. The application description depicts the required deployment
+constraints (e.g., node-type, hardware, sensor requirements, etc.), which enable various filtering options for the
+continuum and cluster layers, that can decide the candidate clusters and nodes, respectively. Having as an example the
+registration of a given application, as shown in Figure 42, we perform a Top-Down propagation of the necessary
+information to each layer of the continuum. Initially, the Continuum agent creates a Kubernetes Custom Resource that is
+propagated to the available Kubernetes clusters. The Cluster agents follow the Kubernetes Operator pattern, so they are
+notified of application creation, update, or removal events. Each Cluster agent manages the components that match its
+cluster ID, if any. This information is provided by the Continuum agent in the application's Custom Resource. A given
+Cluster agent captures the application creation event, parses the description, and deploys the components based on the
+provided requirements. The component specifications are also sent to their host nodes, so that the Node agents can store
+relevant fields required for any potential reconfiguration/adaptation.
+
+![app_description_sequence.png](../assets/img/app_description_sequence.png)
diff --git a/docs/design/architecture.md b/docs/design/architecture.md
@@ -0,0 +1,79 @@
+# Architecture
+
+MLSysOps introduces a hierarchical agent-based architecture composed of three levels:
+- Node Agents reside on individual nodes and expose configuration interfaces, monitor resource usage, and provide direct
+access to telemetry.
+- Cluster Agents coordinate groups of nodes, aggregate telemetry, and issue deployment decisions or adaptation
+instructions.
+- The Continuum Agent sits at the top level, interfacing with external stakeholders (via northbound APIs), receiving
+high-level intents and application descriptors, and coordinating decision-making across slices.
+Each layer operates a Monitor–Analyze–Plan–Execute (MAPE) control loop, enabling autonomous adaptation based on local
+and global telemetry, system optimization targets, and ML-driven policies. Importantly, this architecture separates
+management logic from resource control, allowing for modular evolution and system introspection.
+
+The MLSysOps agents, supported by ML models, analyse, predict, and optimize resource usage patterns and overall system
+performance by allocating, monitoring and configuring the different resources of the underlying layers via the
+mechanisms that are implemented in the context of WP3 and manifested in the current deliverable. This integration is a
+collaborative effort that draws on the diverse expertise of project partners, each contributing unique insights and
+solutions to the multifaceted challenges of cloud and edge computing. This collaborative approach is complemented by an
+iterative development process characterized by continuous testing and feedback loops. Such a process ensures that the
+mechanisms developed are not only effective in their current context but are also scalable and adaptable to future
+technological advancements and operational needs.
+
+<img src="../../assets/img/arch.png" width="600"/>
+
+The following figure depicts a comprehensive illustration of the MLSysOps hierarchical agent system's placement
+and its interactions with two other fundamental subsystems: container orchestration and telemetry. This agent hierarchy
+is structured in line with the orchestration architecture, and it is logically divided into three tiers. The
+communication among the three subsystems (agents, container orchestration, and telemetry) is facilitated through
+designated interfaces at each tier. Moreover, the agent system engages with the continuum level's system agents and
+integrates plug-in configuration policies that can use ML models at all levels. At every level, agents utilize mechanism
+plugins to implement commands for adjusting available configuration and execution mode options.
+
+<img src="../../assets/img/high_level_arch.png" width="600"/>
+
+Node-level agents’ interface with local telemetry systems and expose configuration knobs. Cluster-level agents
+coordinate resource allocation decisions across groups of nodes. At the top level, the continuum agent handles global
+orchestration, provides APIs to external actors, and aggregates telemetry data. ML-driven decisions can be made at every
+layer, using information for the respective layer. This layered approach facilitates scalability and separation of
+concerns while supporting collaboration across orchestration, telemetry, and ML systems. The agent infrastructure
+interacts through three distinct types of interfaces. The Northbound API provides access to application developers and
+system administrators. The Southbound API interfaces with the underlying telemetry collection and configuration
+mechanisms. The ML Connector allows ML models to be plugged into the framework and invoked for training, prediction, and
+explanation tasks.
+The telemetry subsystem is built upon the OpenTelemetry specification and is responsible for collecting and processing
+metrics, logs, and traces. These are abstracted into hierarchical telemetry streams that feed the decision logic of the
+agents and the ML models. Data collection happens at the node level, where individual collectors expose metrics either
+in raw or aggregated formats. These are processed through transformation pipelines and propagated to cluster and
+continuum levels for higher-level aggregation and analysis.
+
+Application deployment and orchestration are driven by declarative descriptions submitted by application developers and
+administrators. These descriptions capture the application's structure, resource requirements, and quality-of-service
+objectives. Deployment is handled through standard container orchestration tools, which are extended by the MLSysOps
+framework to support advanced placement decisions and runtime adaptation. For far-edge deployments, the framework
+introduces a proxy-based architecture involving embServe on constrained devices and a virtual orchestrator service
+running inside containerized environments. This approach allows resource-constrained devices to be seamlessly integrated
+into the same orchestration and telemetry flows as more capable edge and cloud nodes.
+
+The object storage infrastructure builds upon and extends SkyFlok, a secure and distributed storage system. In MLSysOps,
+this infrastructure supports adaptive reconfiguration of bucket policies based on real-time telemetry and application
+usage patterns. The storage system exposes telemetry data regarding latency, bandwidth, and access frequency, enabling
+agents and ML models to optimize redundancy and placement decisions without disrupting ongoing operations.
+The framework also includes specialized subsystems for anomaly detection and trust assessment. These modules analyze
+telemetry data to identify attacks or malfunctions and classify anomalies using ML models. Their outputs are exposed
+through the telemetry interface and used by higher-level agents to trigger remediation strategies or adapt orchestration
+plans. Trust levels for nodes are computed using a combination of identity, behaviour, and capability metrics, forming a
+reputation-based model that influences agent decision-making.
+
+ML models play a central role in enabling the autonomic operation of the framework. Each level of the agent hierarchy
+may employ one or more models, which are integrated via the ML Connector API. These models receive structured telemetry
+input and produce configuration decisions, which are interpreted and enacted by the agents. The framework supports
+reinforcement learning, continual learning, and federated learning scenarios. In addition, explainability mechanisms are
+integrated into the ML workflows to allow system administrators and application developers to understand and audit the
+decisions made by the models.
+
+MLSysOps effectively manages operations by leveraging telemetry data collected from each level, which provides essential
+insights. This data, combined with machine learning models, enhances the decision-making process, aligning with both the
+application's objectives and the system's requirements. Actions based on these decisions are cascaded and refined from
+the top level downwards. The final status and outcomes of these decisions are then made accessible to system Actors. The
+design and functionality of the telemetry system are further explaiend in [Telemetry system design](design/telemetry).
diff --git a/docs/design/controllers.md b/docs/design/controllers.md
@@ -0,0 +1,21 @@
+Controllers are responsible for coordinating all internal components of the framework, including the MAPE tasks, SPADE,
+Policy and Mechanism Plugins, and the Northbound and Southbound APIs.
+
+- **Application Controller**: Manages the lifecycle of the Analyze loop for each application submitted to the system. When a
+new application is submitted, a corresponding Analyze behaviour is initiated, and it is terminated when the application
+is removed.
+
+- **Policy & Mechanism Plugin Controllers**: Responsible for loading, initializing, and configuring policy and mechanism
+plugins. During runtime, these controllers provide updated information to the Application Controller, reflecting any
+changes in the policy API files.
+
+- **Agent Configuration Controller**: Handles external configuration commands received from other agents or via the Northbound
+API, and propagates them to the appropriate internal components. It is also responsible for loading the initial
+configuration file during startup.
+
+- **Telemetry Controller**: Manages the OpenTelemetry Collector for each agent, including initial deployment and runtime
+configuration. Since each collector operates as a pod within the cluster, the Node Agent coordinates with the Cluster
+Agent to request deployment and updates, as depicted in Figure 41. Additionally, this controller configures the Monitor
+task based on the telemetry metrics being collected.
+
+![otel_deploy_sequence.png](../assets/img/otel_deploy_sequence.png)
diff --git a/docs/design/index.md b/docs/design/index.md