-
Notifications
You must be signed in to change notification settings - Fork 1.5k
{AKS} feat(aks-agent): support cluster and client mode #9522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
| rule | cmd_name | rule_message | suggest_message |
|---|---|---|---|
| aks agent | cmd aks agent added parameter mode |
||
| aks agent | cmd aks agent added parameter namespace |
||
| aks agent-cleanup | cmd aks agent-cleanup added parameter mode |
||
| aks agent-cleanup | cmd aks agent-cleanup added parameter namespace |
|
Thank you for your contribution! We will review the pull request and get back to you soon. |
|
The git hooks are available for azure-cli and azure-cli-extensions repos. They could help you run required checks before creating the PR. Please sync the latest code with latest dev branch (for azure-cli) or main branch (for azure-cli-extensions). pip install azdev --upgrade
azdev setup -c <your azure-cli repo path> -r <your azure-cli-extensions repo path>
|
|
2037264 to
3bee7fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for two deployment modes to the AKS agent extension: cluster mode (existing Helm-based deployment) and local mode (new Docker container-based local execution). The changes introduce mode selection during initialization, refactor the agent manager classes using an abstract base class, and add comprehensive telemetry tracking for different event types.
Changes:
- Added interactive mode selection during
az aks agent-initallowing users to choose between cluster mode and local mode - Introduced
AKSAgentManagerLocalclass for Docker-based local execution with local file-based configuration storage - Refactored code using abstract base class
AKSAgentManagerLLMConfigBaseto support both deployment modes - Modified namespace handling to require explicit namespace specification for cluster mode operations
- Enhanced telemetry to track init, cleanup, and startup events with mode information
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| setup.py | Version bump from 1.0.0b14 to 1.0.0b15 |
| custom.py | Major refactoring to support mode selection, separate LLM config and helm deployment setup, add local mode status checking |
| aks_agent_manager.py | Introduced abstract base class, refactored AKSAgentManager with breaking constructor changes, added AKSAgentManagerLocal class for Docker execution |
| llm_config_manager.py | Added LLMConfigManagerLocal class for file-based LLM configuration management |
| telemetry.py | Enhanced to track different event types (init, cleanup, startup) and mode information |
| _params.py | Added namespace and mode parameters for agent and cleanup commands |
| _help.py | Updated help text to document new mode parameter and local mode requirements |
| _consts.py | Added CONFIG_DIR constant for local configuration storage |
| aks.py | Improved kubelogin command to explicitly specify kubeconfig path |
| init.py | Exported AKSAgentManagerLocal class |
| test_aks_agent_manager.py | Removed deprecated test for set_aks_context method |
| README.rst | Updated documentation to explain both deployment modes |
| HISTORY.rst | Added release notes for version 1.0.0b15 |
Comments suppressed due to low confidence (1)
src/aks-agent/azext_aks_agent/agent/k8s/aks_agent_manager.py:1161
- The newly introduced AKSAgentManagerLocal class (lines 905-1161) lacks test coverage. Given that this file has comprehensive test coverage for AKSAgentManager, similar tests should be added for the local mode implementation to verify Docker execution, configuration file handling, and error scenarios.
# Create cluster-specific config directory to match LLMConfigManagerLocal
self.config_dir = self.base_config_dir / subscription_id / resource_group_name / cluster_name
# Docker image for client mode execution
# TODO(mainred): update the docker image to the official one when available
# self.docker_image = "mcr.microsoft.com/aks/aks-agent:89bde40-local"
self.docker_image = "mainred/aks-agent:89bde40-local"
self.llm_config_manager = LLMConfigManagerLocal(
subscription_id=subscription_id,
resource_group_name=resource_group_name,
cluster_name=cluster_name
)
# Ensure custom_toolset.yaml exists
self._ensure_custom_toolset()
def _ensure_custom_toolset(self) -> None:
"""
Ensure custom_toolset.yaml exists in the config directory.
Creates the file with default MCP server configuration if it doesn't exist.
"""
import yaml
custom_toolset_file = self.config_dir / "custom_toolset.yaml"
# Create config directory if it doesn't exist
self.config_dir.mkdir(parents=True, exist_ok=True)
# Only create if file doesn't exist
if not custom_toolset_file.exists():
default_config = {
"mcp_servers": {
"aks-mcp": {
"description": "Azure MCP server exposes the Azure and Kubernetes capabilities for Azure Kubernetes Service clusters",
"config": {
"url": "http://localhost:8000/sse",
}
}
}
}
try:
with open(custom_toolset_file, 'w', encoding='utf-8') as f:
yaml.dump(default_config, f, default_flow_style=False, sort_keys=False)
logger.debug("Created custom_toolset.yaml at: %s", custom_toolset_file)
except Exception as e:
logger.warning("Failed to create custom_toolset.yaml: %s", e)
else:
logger.debug("custom_toolset.yaml already exists at: %s", custom_toolset_file)
def save_llm_config(self, provider: LLMProvider, params: dict) -> None:
"""
Save LLM configuration using the LLMConfigManager.
Args:
provider: LLMProvider instance
params: Dictionary of model parameters
"""
self.llm_config_manager.save(provider, params)
def get_llm_config(self) -> Dict:
"""
Get LLM configuration from local file.
Returns:
Dictionary of model configurations if exists, empty dict otherwise
"""
return self.llm_config_manager.model_list if self.llm_config_manager.model_list else {}
def exec_aks_agent(self, command_flags: str = "") -> bool:
"""
Execute commands on the AKS agent using Docker container.
This method runs the AKS agent in a Docker container with the local
LLM configuration and kubeconfig mounted.
Args:
command_flags: Additional flags for the aks-agent command
Returns:
True if execution was successful
Raises:
AzCLIError: If execution fails or Docker is not available
"""
import subprocess
import sys
logger.info("Executing AKS agent command in Docker container with flags: %s", command_flags)
try:
# Check if configuration exists
model_list_file = self.config_dir / "model_list.yaml"
custom_toolset_file = self.config_dir / "custom_toolset.yaml"
if not self.config_dir.exists() or not model_list_file.exists() or not custom_toolset_file.exists():
raise AzCLIError(
"AKS agent configuration not found.\n"
"Please run 'az aks agent-init' first to initialize the agent."
)
# Check if Docker is available
try:
subprocess.run(
["docker", "--version"],
check=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
except (subprocess.CalledProcessError, FileNotFoundError):
raise AzCLIError(
"Docker is not available. Please install Docker to use client mode.\n"
"Visit https://docs.docker.com/get-docker/ for installation instructions."
)
# Prepare Docker run command
docker_image = self.docker_image
# Build volume mounts
volumes = []
if self.kubeconfig_path:
volumes.extend(["-v", f"{self.kubeconfig_path}:/root/.kube/config:ro"])
# Mount Azure config directory
azure_config_dir = os.path.expanduser("~/.azure")
if os.path.exists(azure_config_dir):
volumes.extend(["-v", f"{azure_config_dir}:/root/.azure:ro"])
logger.debug("Mounting Azure config directory: %s", azure_config_dir)
else:
logger.debug("Azure config directory not found, skipping mount: %s", azure_config_dir)
# Mount LLM config files
model_list_file = self.config_dir / "model_list.yaml"
custom_toolset_file = self.config_dir / "custom_toolset.yaml"
# Mount model_list.yaml
volumes.extend(["-v", f"{model_list_file}:/etc/aks-agent/config/model_list.yaml:ro"])
# Mount custom_toolset.yaml
volumes.extend(["-v", f"{custom_toolset_file}:/etc/aks-agent/config/custom_toolset.yaml:ro"])
# Build environment variables for AKS context
env_vars = [
"-e", f"AKS_RESOURCE_GROUP={self.resource_group_name}",
"-e", f"AKS_CLUSTER_NAME={self.cluster_name}",
"-e", f"AKS_SUBSCRIPTION_ID={self.subscription_id}"
]
# Prepare the command
exec_command = [
"docker", "run",
"--rm",
"-it",
*volumes,
*env_vars,
docker_image,
"ask"
]
# Add command flags if provided
if command_flags:
# Parse command_flags - it might be a quoted string with multiple args
import shlex
flag_parts = shlex.split(command_flags)
exec_command.extend(flag_parts)
logger.debug("Running Docker command: %s", " ".join(exec_command))
# Execute the Docker container with interactive TTY
result = subprocess.run(
exec_command,
stdin=sys.stdin,
stdout=sys.stdout,
stderr=sys.stderr
)
if result.returncode != 0:
raise AzCLIError(f"Docker container exited with code {result.returncode}")
logger.info("AKS agent command executed successfully in Docker container")
return True
except AzCLIError:
raise
except subprocess.CalledProcessError as e:
logger.error("Failed to execute Docker command: %s", e)
raise AzCLIError(f"Failed to execute AKS agent in Docker: {e}")
except Exception as e:
logger.error("Failed to execute AKS agent command: %s", e)
raise
def uninstall_agent(self) -> bool:
"""
Uninstall AKS agent by removing local LLM configuration files.
Returns:
True if uninstallation was successful
"""
logger.info("Removing local AKS agent configuration")
try:
config_files = [
self.config_dir / "model_list.yaml",
self.config_dir / "custom_toolset.yaml"
]
removed_files = []
for config_file in config_files:
if config_file.exists():
config_file.unlink()
removed_files.append(str(config_file))
logger.debug("Removed config file: %s", config_file)
if removed_files:
logger.info("Successfully removed %d configuration file(s)", len(removed_files))
else:
logger.info("No configuration files found to remove")
# Optionally remove the config directory if it's empty
if self.config_dir.exists() and not any(self.config_dir.iterdir()):
self.config_dir.rmdir()
logger.debug("Removed empty config directory: %s", self.config_dir)
return True
except Exception as e:
logger.error("Failed to remove local configuration: %s", e)
raise AzCLIError(f"Failed to uninstall local agent configuration: {e}")
| self.cluster_name: str = cluster_name | ||
| self.subscription_id: str = subscription_id | ||
|
|
||
| self.chart_repo = "oci://mcr.microsoft.com/aks/aks-agent-chart/aks-agent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use the official one
|
/azp run |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
@yanzhudd could you please help merge this PR? Thanks. |
|
[Release] Update index.json for extension [ aks-agent-1.0.0b15 ] : https://dev.azure.com/msazure/One/_build/results?buildId=149894816&view=results |
This checklist is used to make sure that common guidelines for a pull request are followed.
Related command
General Guidelines
azdev style <YOUR_EXT>locally? (pip install azdevrequired)python scripts/ci/test_index.py -qlocally? (pip install wheel==0.30.0required)For new extensions:
About Extension Publish
There is a pipeline to automatically build, upload and publish extension wheels.
Once your pull request is merged into main branch, a new pull request will be created to update
src/index.jsonautomatically.You only need to update the version information in file setup.py and historical information in file HISTORY.rst in your PR but do not modify
src/index.json.