diff --git a/documentation/3.kernel/INDEX.md b/documentation/3.kernel/INDEX.md
index 59cd6458464..3c874cd77ea 100644
--- a/documentation/3.kernel/INDEX.md
+++ b/documentation/3.kernel/INDEX.md
@@ -9,3 +9,4 @@
- @subpage page_memory_management
- @subpage page_interrupt_management
- @subpage page_kernel_porting
+- @subpage page_kernel_smp_boot
diff --git a/documentation/3.kernel/smp-startup/README.md b/documentation/3.kernel/smp-startup/README.md
new file mode 100644
index 00000000000..be080346fee
--- /dev/null
+++ b/documentation/3.kernel/smp-startup/README.md
@@ -0,0 +1,136 @@
+@page page_kernel_smp_boot QEMU virt64 AArch64 SMP Boot Flow
+
+# QEMU virt64 AArch64 SMP Boot Flow
+
+This guide walks through the multi-core boot path of RT-Thread on AArch64 using `bsp/qemu-virt64-aarch64` as the concrete reference. It is written to be beginner-friendly and mirrors the current BSP implementation: from `_start` assembly, early MMU bring-up, `rtthread_startup()`, PSCI wakeup of secondary cores, to all CPUs entering the scheduler. The original PlantUML diagram is replaced by Mermaid so it renders directly on GitHub.
+
+- Target setup: QEMU `-machine virt`, `-cpu cortex-a57`, `-smp >=2`, `RT_USING_SMP` enabled, device tree contains `enable-method = "psci"`.
+- Goal: Know who does what, where the code lives, and what to check when SMP does not come up.
+
+## Big Picture First
+
+```mermaid
+flowchart TD
+ ROM[BootROM/BL1
QEMU firmware] --> START["_start
(entry_point.S)"]
+ START --> MMU["init_mmu_early
enable_mmu_early"]
+ MMU --> CBOOT[rtthread_startup()]
+ CBOOT --> BOARD["rt_hw_board_init
-> rt_hw_common_setup"]
+ BOARD --> MAIN[main_thread_entry]
+ MAIN --> PSCI["rt_hw_secondary_cpu_up
(PSCI CPU_ON)"]
+ PSCI --> SECASM["_secondary_cpu_entry
(ASM)"]
+ SECASM --> SECC[rt_hw_secondary_cpu_bsp_start]
+ SECC --> SCHED[rt_system_scheduler_start]
+ SCHED --> RUN[SMP scheduling]
+```
+
+## Boot CPU: from `_start` to MMU on
+
+**Input registers**: QEMU firmware loads the image and jumps to `_start` at `libcpu/aarch64/cortex-a/entry_point.S`, passing the DTB physical address in `x0` (with `x1~x3` reserved).
+
+**What `_start` does (short version)**
+
+1. Clear thread pointers: zero `tpidr_el1/tpidrro_el0` to avoid stale per-cpu state.
+2. Unify exception level: `init_cpu_el` drops to EL1h, enables timer access, masks unwanted traps.
+3. Clear BSS: `init_kernel_bss` fills `__bss` with zeros so globals start clean.
+4. Prepare stack: `init_cpu_stack_early` switches to SP_EL1 and uses `.boot_cpu_stack_top` as the early stack.
+5. Remember the FDT: `rt_hw_fdt_install_early(x0)` stores DTB address/size before MMU is enabled.
+6. Early MMU mapping: `init_mmu_early`/`enable_mmu_early` build a 0~1G identity map, set TTBR0/TTBR1 and SCTLR_EL1, flush I/D cache and TLB, then branch to `rtthread_startup()` (address in x8).
+
+> Tip: the early page table only covers minimal kernel space; the C phase will remap a fuller layout.
+
+## C-side startup backbone
+
+`rtthread_startup()` (in `src/components.c`) is the spine of the sequence:
+
+- **Interrupts off + spinlock ready**: `rt_hw_local_irq_disable()` followed by `_cpus_lock` init to keep early steps non-preemptible.
+- **Board init**: `rt_hw_board_init()` directly calls the BSP hook `rt_hw_common_setup()` (`libcpu/aarch64/common/setup.c`) to:
+ - set VBAR, build kernel address space, copy DTB to a safe region and pre-parse it;
+ - configure MMU mappings; init memblock/page allocator/system heap;
+ - parse DT for console, memory, initrd;
+ - init GIC (and GICv3 Redistributor if enabled), UART, global GTIMER;
+ - install SMP IPIs (`RT_SCHEDULE_IPI`, `RT_STOP_IPI`, `RT_SMP_CALL_IPI`) and unmask them;
+ - set idle hook `rt_hw_idle_wfi` so idle CPUs enter low-power wait.
+- **Kernel subsystems**: init system timer, scheduler, signals, and create main/timer/idle threads.
+- **Start scheduling**: `rt_system_scheduler_start()` runs `main_thread_entry()` first.
+
+## How secondary cores are brought up
+
+`main_thread_entry()` calls `rt_hw_secondary_cpu_up()` before invoking user `main()`, so all CPUs join scheduling.
+
+### What `rt_hw_secondary_cpu_up()` does
+
+1. Convert `_secondary_cpu_entry` to a physical address via `rt_kmem_v2p()`—the real entry the firmware jumps to.
+2. Walk CPU nodes recorded at boot (`cpu_info_init()` stored DTB info in `cpu_np[]` and `rt_cpu_mpidr_table[]`).
+3. Read `enable-method`:
+ - QEMU virt64: `"psci"` → use `cpu_psci_ops.cpu_boot()` to issue `CPU_ON(target, entry)` to firmware.
+ - Legacy compatibility: `"spin-table"` → write `cpu-release-addr` and `sev` to wake.
+4. Any failure prints a warning but does not halt the boot flow, making diagnosis easier.
+
+### What happens on a secondary core
+
+- **Assembly entry `_secondary_cpu_entry`**:
+ - Read `mpidr_el1`, compare with `rt_cpu_mpidr_table` to find the logical CPU id, store it back, and write it into `TPIDR` for per-cpu access.
+ - Allocate its own stack by offsetting `ARCH_SECONDARY_CPU_STACK_SIZE` per core.
+ - Re-run `init_cpu_el`/`init_cpu_stack_early`, reuse the same early MMU path, then branch to `rt_hw_secondary_cpu_bsp_start()`.
+
+- **C-side handoff `rt_hw_secondary_cpu_bsp_start()`** (`libcpu/aarch64/common/setup.c`):
+ - Reset VBAR and synchronize with the boot CPU via `_cpus_lock`.
+ - Update this core's MPIDR entry and bind the shared `MMUTable`.
+ - Init local vector table, GIC CPU interface (and GICv3 Redistributor if present), enable the local GTIMER.
+ - Unmask the three SMP IPIs; re-calibrate `loops_per_tick` for microsecond delay if needed.
+ - Call `rt_dm_secondary_cpu_init()` to register the CPU device, then enter the scheduler via `rt_system_scheduler_start()`.
+
+### Timeline (Mermaid)
+
+```mermaid
+sequenceDiagram
+ participant ROM as BootROM/BL1
+ participant START as _start (ASM)
+ participant CBOOT as rtthread_startup
+ participant MAIN as main_thread_entry
+ participant FW as PSCI firmware
+ participant SECASM as _secondary_cpu_entry
+ participant SECC as rt_hw_secondary_cpu_bsp_start
+ participant SCHED as Scheduler (all CPUs)
+
+ ROM->>START: x0=DTB, jump to _start
+ START->>START: init_cpu_el / clear BSS / set stack
+ START->>START: init_mmu_early + enable_mmu_early
+ START-->>CBOOT: branch to rtthread_startup()
+ CBOOT->>CBOOT: rt_hw_board_init -> rt_hw_common_setup
+ CBOOT-->>SCHED: rt_system_scheduler_start()
+ SCHED-->>MAIN: run main_thread_entry
+ MAIN->>FW: rt_hw_secondary_cpu_up (CPU_ON)
+ FW-->>SECASM: entry = _secondary_cpu_entry
+ SECASM->>SECASM: stack/TPIDR/EL setup
+ SECASM-->>SECC: enable_mmu_early -> rt_hw_secondary_cpu_bsp_start
+ SECC->>SECC: local GIC/Timer/IPI init
+ SECC-->>SCHED: rt_system_scheduler_start()
+ SCHED-->>MAIN: continue main()
+ SCHED-->>Others: SMP scheduling
+```
+
+## Source map (where to read the code)
+
+| Stage | File | Role |
+| --- | --- | --- |
+| Boot assembly | `libcpu/aarch64/cortex-a/entry_point.S` | `_start`, `_secondary_cpu_entry`, early MMU enable |
+| BSP hook | `bsp/qemu-virt64-aarch64/drivers/board.c` | Wires `rt_hw_board_init()` to `rt_hw_common_setup()` |
+| Memory/GIC/IPI init | `libcpu/aarch64/common/setup.c` | `rt_hw_common_setup()`, `rt_hw_secondary_cpu_up()`, `rt_hw_secondary_cpu_bsp_start()` |
+| C entry skeleton | `src/components.c` | `rtthread_startup()`, `main_thread_entry()` |
+
+## Quick checks when SMP fails to come up
+
+- Device tree: contains `enable-method = "psci"` and QEMU is started with `-machine virt` (PSCI firmware included).
+- `_secondary_cpu_entry` physical address: `rt_kmem_v2p()` must not return 0, otherwise a check fails.
+- Init order: GIC/Timer must be ready before calling `rt_hw_secondary_cpu_up()`; if you fork a custom BSP, do these first.
+- UART logs: look for `Call cpu X on success/failed`; add extra prints in `_secondary_cpu_entry` if needed, and use QEMU `-d cpu_reset -smp N` to debug.
+
+## AArch64 pocket notes (just enough)
+
+- **Exception levels**: startup may be at EL3/EL2; `init_cpu_el` descends to EL1h where the kernel runs.
+- **Two stack pointers**: `spsel #1` selects `SP_EL1` so user mode cannot touch the kernel stack.
+- **MMU bring-up order**: build page tables → configure TCR/TTBR → flush cache/TLB → set `SCTLR_EL1.M/C/I` → `isb`.
+- **MPIDR**: unique core affinity; stored in `rt_cpu_mpidr_table[]` to map logical CPU ids and IPI targets.
+
+With these in place, the QEMU virt64 AArch64 BSP SMP path is clear: the boot CPU prepares memory and shared peripherals, `main_thread_entry()` issues PSCI wakeups, secondary cores land with the same MMU/EL setup, and all CPUs join the scheduler.
diff --git a/documentation/3.kernel/smp-startup/README_zh.md b/documentation/3.kernel/smp-startup/README_zh.md
new file mode 100644
index 00000000000..2c537cff083
--- /dev/null
+++ b/documentation/3.kernel/smp-startup/README_zh.md
@@ -0,0 +1,136 @@
+@page page_kernel_smp_boot_zh QEMU virt64 AArch64 多核启动流程(中文)
+
+# QEMU virt64 AArch64 多核启动流程
+
+本文以 `bsp/qemu-virt64-aarch64` 为例,对 RT-Thread 在 AArch64 平台上的多核启动做一份“初学者友好”的拆解,覆盖从 `_start` 汇编、MMU 打开、`rtthread_startup()`,到 PSCI 唤醒次级核并全部进入调度器的完整链路。全文基于当前 BSP 的真实实现,顺手补全一些容易忽略的细节,并把原有 PlantUML 图改成可在 GitHub 直接渲染的 Mermaid。
+
+- 适用环境:QEMU `-machine virt`、`-cpu cortex-a57`、`-smp >=2`,`RT_USING_SMP` 已开启,设备树包含 `enable-method = "psci"`。
+- 读完你将能:看懂每一步是谁做的、代码在哪、如果多核没起来要检查什么。
+
+## 全局先看一眼
+
+```mermaid
+flowchart TD
+ ROM[BootROM/BL1
QEMU 固件] --> START["_start
(entry_point.S)"]
+ START --> MMU["init_mmu_early
enable_mmu_early"]
+ MMU --> CBOOT[rtthread_startup()]
+ CBOOT --> BOARD["rt_hw_board_init
-> rt_hw_common_setup"]
+ BOARD --> MAIN[main_thread_entry]
+ MAIN --> PSCI["rt_hw_secondary_cpu_up
(PSCI CPU_ON)"]
+ PSCI --> SECASM["_secondary_cpu_entry
(ASM)"]
+ SECASM --> SECC[rt_hw_secondary_cpu_bsp_start]
+ SECC --> SCHED[rt_system_scheduler_start]
+ SCHED --> RUN[多核调度运行态]
+```
+
+## Boot CPU:从 `_start` 到 MMU 打开
+
+**输入参数**:QEMU 固件把镜像装入内存,跳到 `libcpu/aarch64/cortex-a/entry_point.S` 的 `_start`,同时 `x0` 带上 DTB 物理地址,`x1~x3` 预留。
+
+**`_start` 做的事(精简版)**
+
+1. 清理线程指针:`tpidr_el1/tpidrro_el0` 置零,避免继承旧状态。
+2. 异常级统一:`init_cpu_el` 把 CPU 拉到 EL1h,打开计时器访问,关掉不必要的陷入。
+3. BSS 清零:`init_kernel_bss` 循环写 0,保证全局变量干净。
+4. 栈准备:`init_cpu_stack_early` 切换到 SP_EL1,并使用链接脚本里的 `.boot_cpu_stack_top` 作为启动栈。
+5. 保存 FDT:`rt_hw_fdt_install_early(x0)` 在 MMU 开启前记录 DTB 起始地址和大小。
+6. MMU 早期映射:`init_mmu_early`/`enable_mmu_early` 建立 0~1G 恒等映射,设置 TTBR0/TTBR1、SCTLR_EL1,清理 I/D Cache 与 TLB,完成后跳转到 `rtthread_startup()`(寄存器 x8)。
+
+> 小贴士:早期页表只够最小内核布局,后面会在 C 里重新映射更完整的空间。
+
+## 进入 C 语言后的启动骨架
+
+`rtthread_startup()`(`src/components.c`)是整条链路的骨干,关键点如下:
+
+- **禁中断 + 自旋锁**:先 `rt_hw_local_irq_disable()`,再初始化 `_cpus_lock`,避免启动阶段被抢占。
+- **板级初始化**:`rt_hw_board_init()` 直接调用 BSP 的 `rt_hw_common_setup()`(`libcpu/aarch64/common/setup.c`),完成:
+ - 设置 VBAR(异常向量)、建立内核地址空间、拷贝 DTB 到安全内存并预解析;
+ - 配置 MMU 映射、初始化 memblock/页分配器/系统堆;
+ - 解析设备树:控制台、内存、initrd;
+ - 初始化 GIC(或 GICv3 Redistributor)、UART、全局 GTIMER;
+ - 安装 SMP IPI:`RT_SCHEDULE_IPI`、`RT_STOP_IPI`、`RT_SMP_CALL_IPI` 并解除屏蔽;
+ - 设置空闲钩子 `rt_hw_idle_wfi`,保证空闲时进入低功耗等待。
+- **内核子系统**:初始化系统定时器、调度器、信号机制,创建 main/定时/空闲线程。
+- **调度器启动**:`rt_system_scheduler_start()` 让 `main_thread_entry()` 首先运行。
+
+## 次级核如何被拉起
+
+`main_thread_entry()` 在调用用户 `main()` 前会执行 `rt_hw_secondary_cpu_up()`,确保所有 CPU 都进调度器。
+
+### `rt_hw_secondary_cpu_up()` 做什么
+
+1. 把 `_secondary_cpu_entry` 转成物理地址(`rt_kmem_v2p()`),这是固件要跳转的真实入口。
+2. 遍历启动时记录的 CPU 节点(`cpu_info_init()` 已把 DTB 信息存进 `cpu_np[]` 和 `rt_cpu_mpidr_table[]`)。
+3. 读取 `enable-method`:
+ - QEMU virt64:`"psci"` → 走 `cpu_psci_ops.cpu_boot()`,向固件发 `CPU_ON(target, entry)`;
+ - 兼容老平台:`"spin-table"` → 写 `cpu-release-addr`,再 `sev` 唤醒。
+4. 任一核失败会打印 Warning,但主核流程不会被中断,便于后续排查。
+
+### 发生在次级核的事
+
+- **汇编入口 `_secondary_cpu_entry`**:
+ - 读取 `mpidr_el1`,和 `rt_cpu_mpidr_table` 比对确认逻辑核号并写回表项,随后将逻辑核号写入 `TPIDR`,便于 per-cpu 访问。
+ - 按 `ARCH_SECONDARY_CPU_STACK_SIZE` 为每个核分配独立栈。
+ - 重复 `init_cpu_el`、`init_cpu_stack_early`,共用同一套早期 MMU 建表逻辑,最后跳到 `rt_hw_secondary_cpu_bsp_start()`。
+
+- **C 侧收尾 `rt_hw_secondary_cpu_bsp_start()`**(`libcpu/aarch64/common/setup.c`):
+ - 重新设置 VBAR,并持有 `_cpus_lock` 与主核同步。
+ - 更新本核的 MPIDR 表项,绑定全局 `MMUTable`。
+ - 初始化本地向量表、GIC CPU 接口(和 GICv3 Redistributor,如果开启)、开启本地 GTIMER。
+ - 解除三种 IPI 屏蔽,必要时重新校准 `loops_per_tick`(us 延时)。
+ - 调用 `rt_dm_secondary_cpu_init()` 注册 CPU 设备,最后 `rt_system_scheduler_start()` 让该核进入调度。
+
+### 时序图(Mermaid)
+
+```mermaid
+sequenceDiagram
+ participant ROM as BootROM/BL1
+ participant START as _start (ASM)
+ participant CBOOT as rtthread_startup
+ participant MAIN as main_thread_entry
+ participant FW as PSCI 固件
+ participant SECASM as _secondary_cpu_entry
+ participant SECC as rt_hw_secondary_cpu_bsp_start
+ participant SCHED as Scheduler(全部CPU)
+
+ ROM->>START: x0=DTB,跳转 _start
+ START->>START: init_cpu_el / 清 BSS / 设栈
+ START->>START: init_mmu_early + enable_mmu_early
+ START-->>CBOOT: 跳到 rtthread_startup()
+ CBOOT->>CBOOT: rt_hw_board_init -> rt_hw_common_setup
+ CBOOT-->>SCHED: rt_system_scheduler_start()
+ SCHED-->>MAIN: 调度 main_thread_entry
+ MAIN->>FW: rt_hw_secondary_cpu_up (CPU_ON)
+ FW-->>SECASM: entry = _secondary_cpu_entry
+ SECASM->>SECASM: 栈/TPIDR/EL 初始化
+ SECASM-->>SECC: enable_mmu_early -> rt_hw_secondary_cpu_bsp_start
+ SECC->>SECC: GIC/Timer/IPI 本地初始化
+ SECC-->>SCHED: rt_system_scheduler_start()
+ SCHED-->>MAIN: 继续 main()
+ SCHED-->>其他线程: 多核调度
+```
+
+## 关键代码位置对照表
+
+| 阶段 | 主要文件 | 作用 |
+| --- | --- | --- |
+| 启动汇编 | `libcpu/aarch64/cortex-a/entry_point.S` | `_start`、`_secondary_cpu_entry`、MMU 早期开启 |
+| BSP 汇聚 | `bsp/qemu-virt64-aarch64/drivers/board.c` | 把 `rt_hw_board_init()` 对接到 `rt_hw_common_setup()` |
+| 内存/GIC/IPI 初始化 | `libcpu/aarch64/common/setup.c` | `rt_hw_common_setup()`、`rt_hw_secondary_cpu_up()`、`rt_hw_secondary_cpu_bsp_start()` |
+| C 入口骨架 | `src/components.c` | `rtthread_startup()`、`main_thread_entry()` |
+
+## 常见检查项(多核没起来时)
+
+- 设备树是否有 `enable-method = "psci"`,且 QEMU 启动带了 `-machine virt`(自带 PSCI 固件)。
+- `_secondary_cpu_entry` 能否正确转成物理地址:`rt_kmem_v2p()` 返回 0 会触发断言。
+- GIC/Timer 是否在主核初始化完成后才去唤核;若自定义 BSP,务必在调用 `rt_hw_secondary_cpu_up()` 前完成中断与定时器初始化。
+- 观察串口日志中的 `Call cpu X on success/failed`,必要时在 `_secondary_cpu_entry` 里加额外打印,结合 `-d cpu_reset -smp N` 排查。
+
+## AArch64 小抄(够用版)
+
+- **异常级**:启动时可能在 EL3/EL2,`init_cpu_el` 会层层降到内核跑的 EL1h。
+- **双栈指针**:`spsel #1` 选用 `SP_EL1`,保证内核栈不被 EL0 访问。
+- **MMU 开启顺序**:写页表 → 配置 TCR/TTBR → 刷 Cache/TLB → 置位 `SCTLR_EL1.M/C/I` → `isb` 生效。
+- **MPIDR**:多核唯一标识,`rt_cpu_mpidr_table[]` 保存 Boot CPU 和各次级核的 affinity,便于逻辑核编号和 IPI 目标匹配。
+
+做到这里,QEMU virt64 AArch64 BSP 的多核启动主线基本就清楚了:Boot CPU 负责把内核和公共外设准备好,`main_thread_entry()` 发起 PSCI 唤核,次级核按同样的 MMU/EL 设置落地,再一起进入调度器。