From 084fa57a62e533132df0d41fa1ff8dd7f23811d1 Mon Sep 17 00:00:00 2001 From: zhou-haitao <1300182097@qq.com> Date: Wed, 3 Dec 2025 17:34:25 +0800 Subject: [PATCH 1/4] fix docs --- .../getting-started/installation_npu.md | 6 +-- docs/source/getting-started/quick_start.md | 46 ++++++++++++++---- .../user-guide/prefix-cache/nfs_store.md | 48 +++++++++++++++---- 3 files changed, 79 insertions(+), 21 deletions(-) diff --git a/docs/source/getting-started/installation_npu.md b/docs/source/getting-started/installation_npu.md index 571e96e15..f59109895 100644 --- a/docs/source/getting-started/installation_npu.md +++ b/docs/source/getting-started/installation_npu.md @@ -6,7 +6,7 @@ This document describes how to install unified-cache-management when using Ascen - Python: >= 3.9, < 3.12 - A hardware with Ascend NPU. It’s usually the Atlas 800 A2 series. -The current version of unified-cache-management based on vLLM-Ascend v0.9.2rc1, refer to [vLLM-Ascend Installation Requirements](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) to meet the requirements. +The current version of unified-cache-management based on vLLM-Ascend v0.11.0rc1 and v0.9.1, refer to [vLLM-Ascend Installation Requirements](https://vllm-ascend.readthedocs.io/en/latest/installation.html#requirements) to meet the requirements. You have 2 ways to install for now: - Setup from code: First, prepare vLLM-Ascend environment, then install unified-cache-management from source code. @@ -17,14 +17,14 @@ You have 2 ways to install for now: ### Prepare vLLM-Ascend Environment For the sake of environment isolation and simplicity, we recommend preparing the vLLM-Ascend environment by pulling the official, pre-built vLLM-Ascend Docker image. ```bash -docker pull quay.io/ascend/vllm-ascend:v0.9.2rc1 +docker pull quay.io/ascend/vllm-ascend:v0.9.1 ``` Use the following command to run your own container: ```bash # Update DEVICE according to your device (/dev/davinci[0-7]) export DEVICE=/dev/davinci7 # Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:v0.9.2rc1 +export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1 docker run --rm \ --name vllm-ascend-env \ --device $DEVICE \ diff --git a/docs/source/getting-started/quick_start.md b/docs/source/getting-started/quick_start.md index 098c2eeb7..ed8a4c361 100644 --- a/docs/source/getting-started/quick_start.md +++ b/docs/source/getting-started/quick_start.md @@ -54,19 +54,47 @@ python offline_inference.py For online inference , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. -First, specify the python hash seed by: -```bash -export PYTHONHASHSEED=123456 -``` + Create a config yaml like following and save it to your own directory: ```yaml # UCM Configuration File Example -# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details -ucm_connector_name: "UcmNfsStore" - -ucm_connector_config: - storage_backends: "/mnt/test" +# +# This file demonstrates how to configure UCM using YAML. +# You can use this config file by setting the path to this file in kv_connector_extra_config in launch script or command line like this: +# kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"} +# +# Alternatively, you can still use kv_connector_extra_config in KVTransferConfig +# for backward compatibility. + +# Connector name (e.g., "UcmNfsStore", "UcmDramStore") +ucm_connectors: + - ucm_connector_name: "UcmNfsStore" + ucm_connector_config: + storage_backends: "/mnt/test" + use_direct: false + +load_only_first_rank: false + +# Enable UCM metrics so they can be monitored online via Grafana and Prometheus. +# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml" + +# Sparse attention configuration +# Format 1: Dictionary format (for methods like ESA, KvComp) +# ucm_sparse_config: +# ESA: +# init_window_sz: 1 +# local_window_sz: 2 +# min_blocks: 4 +# sparse_ratio: 0.3 +# retrieval_stride: 5 + # Or for GSA: + # GSA: {} + + +# Whether to use layerwise loading/saving (optional, default: True for UnifiedCacheConnectorV1) +# use_layerwise: true +# hit_ratio: 0.9 ``` Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model and your config file path: diff --git a/docs/source/user-guide/prefix-cache/nfs_store.md b/docs/source/user-guide/prefix-cache/nfs_store.md index 741fcedf7..dd7d36fe9 100644 --- a/docs/source/user-guide/prefix-cache/nfs_store.md +++ b/docs/source/user-guide/prefix-cache/nfs_store.md @@ -90,12 +90,44 @@ To use the NFS connector, you need to configure the `connector_config` dictionar Create a config yaml like following and save it to your own directory: ```yaml # UCM Configuration File Example -# Refer to file unified-cache-management/examples/ucm_config_example.yaml for more details -ucm_connector_name: "UcmNfsStore" +# +# This file demonstrates how to configure UCM using YAML. +# You can use this config file by setting the path to this file in kv_connector_extra_config in launch script or command line like this: +# kv_connector_extra_config={"UCM_CONFIG_FILE": "/workspace/unified-cache-management/examples/ucm_config_example.yaml"} +# +# Alternatively, you can still use kv_connector_extra_config in KVTransferConfig +# for backward compatibility. + +# Connector name (e.g., "UcmNfsStore", "UcmDramStore") +ucm_connectors: + - ucm_connector_name: "UcmNfsStore" + ucm_connector_config: + storage_backends: "/mnt/test" + use_direct: false + +load_only_first_rank: false + +# Enable UCM metrics so they can be monitored online via Grafana and Prometheus. +# metrics_config_path: "/workspace/unified-cache-management/examples/metrics/metrics_configs.yaml" + +# Sparse attention configuration +# Format 1: Dictionary format (for methods like ESA, KvComp) +# ucm_sparse_config: +# ESA: +# init_window_sz: 1 +# local_window_sz: 2 +# min_blocks: 4 +# sparse_ratio: 0.3 +# retrieval_stride: 5 + # Or for GSA: + # GSA: {} + + +# Whether to use layerwise loading/saving (optional, default: True for UnifiedCacheConnectorV1) +# use_layerwise: true +# hit_ratio: 0.9 + -ucm_connector_config: - storage_backends: "/mnt/test" - transferStreamNumber: 32 ``` ## Launching Inference @@ -116,7 +148,6 @@ Then run the script as follows: ```bash cd examples/ -export PYTHONHASHSEED=123456 python offline_inference.py ``` @@ -166,10 +197,9 @@ curl http://localhost:7800/v1/completions \ ``` To quickly experience the NFS Connector's effect: -1. Start the service with: - `--no-enable-prefix-caching` +1. Start the service with: `--no-enable-prefix-caching` 2. Send the same request (exceed 128 tokens) twice consecutively -3. Remember to enable prefix caching (do not add `--no-enable-prefix-caching`) in production environments. + ### Log Message Structure ```text [UCMNFSSTORE] [I] Task(,,,) finished, elapsed