Skip to content

Commit 3b6ef61

Browse files
committed
Merge branch 'main' of academic:OpenMotionLab/MotionGPT
2 parents 341f42d + e12e92b commit 3b6ef61

File tree

1 file changed

+12
-6
lines changed

1 file changed

+12
-6
lines changed

README.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -258,7 +258,8 @@ optional parameters:
258258

259259
<details>
260260
<summary>Instruction tuning and zero-shot learning.</summary>
261-
<img width="853" alt="figure12" src="./public/images/figure12.png">
261+
<img width="853" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/4b5985b3-2a26-4b09-80a0-05a15343bf23">
262+
262263

263264
**Answer:** We propose instruction tuning to **train a single MotionGPT across all motion-related tasks**, while task-specific tuning is to train and evaluate MotionGPTs on a single task. We employ these two training schemes to study the ability of MotionGPT across multi-tasks. As shown in this figure, we provide **zero-shot cases**. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like "**scuttling**" and "**barriers**", and generate correct motions based on the meaning of sentences. However, it still struggles to generate **unseen motions**, like gymnastics, even if MotionGPTs understand the text inputs.
264265

@@ -273,7 +274,9 @@ optional parameters:
273274

274275
<details>
275276
<summary>How well MotionGPT learns the relationship between motion and language?</summary>
276-
<img width="300" alt="figure10" src="./public/images/figure10.png"><img width="600" alt="figure12" src="./public/images/figure12.png">
277+
<img width="300" alt="figure10" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/a27abc97-ead2-4abd-a32c-e14049ba2421"><img width="600" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/c82c1aee-c3e5-4090-8ddd-d0c78aae3330">
278+
279+
277280

278281
**Answer:** **Unlike** the previous motion generators using the **text encoder of CLIP** for conditions, please note that MotionGPTs leverage language models to learn the motion-language relationship, instead of relying on text features from CLIP. According to our zero-shot results (cf. **Fig. 12**) and performances on multi-tasks (cf. **Fig. 10**), MotionGPTs establish robust connections between simple/complex texts and simple motions in evaluations, but they fall short when it comes to complex-text to **complex motion translation**.
279282

@@ -283,7 +286,9 @@ optional parameters:
283286

284287
<details>
285288
<summary>Why choose T5, an encoder-decoder architecture, as the base model? How about a decoder-only model, like LLaMA?</summary>
286-
<img width="866" alt="table15" src="./public/images/table15.png">
289+
<img width="866" alt="table15" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/8f58ee1e-6a10-4b5c-9939-f79ba2ecccae">
290+
291+
287292

288293
**Answer:** The **first language model that we used** to build MotionGPTs is **LLaMA-13B**. However, it shows insufficient performance and low training efficiency. We assume the reason is the limited dataset size compared to the large parameters and language data of LLaMA. We tried a smaller size decoder-only backbone **GPT2-Medium** and provide the results in **Tab. 15**. Then, we thus chose **T5-770M**, a small but common language model, as our final backbone, because many previous vision-language multimodal works, like **Unified-IO** and **BLIP**, have chosen T5, this encoder-decoder architecture. It shows a strong power to address multi-modal tasks. In addition, the decoder-only model has the advantage for self-supervised without pair data while we have paired data which this advance is greatly weakened. We are still working on collecting a large motion dataset for larger motion-language models.
289294

@@ -351,7 +356,7 @@ optional parameters:
351356

352357
<details>
353358
<summary> Failure analysis. Zero-shot ability to handle words that have semantic meaning but could be unseen.</summary>
354-
<img width="853" alt="figure12" src="./public/images/figure12.png">
359+
<img width="853" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/c82c1aee-c3e5-4090-8ddd-d0c78aae3330">
355360

356361
**Answer:** As shown in **Fig. 12**, we provide both **zero-shot cases** and **failure cases**. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like "**scuttling**" and "**barriers**", and generate correct motions based on the meaning of sentences. However, it still struggles to generate unseen motions, like gymnastics, even if MotionGPTs understand the text inputs.
357362

@@ -424,7 +429,7 @@ The real challenge lies in reconstructing complex motions, such as diving or gym
424429

425430
<details>
426431
<summary> MotionGPT seems to sacrifice accuracy in exchange for additional functionalities.</summary>
427-
<img width="447" alt="figure10" src="./public/images/figure10.png">
432+
<img width="447" alt="figure10" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/a27abc97-ead2-4abd-a32c-e14049ba2421">
428433

429434
**Answer:** As shown in **Fig. 10**, MotionGPT achieves SOTA on **18 out of 23** metrics across four motion-related tasks. Additionally, both HumanML3D and KIT are limited in overall dataset size, particularly when compared to billion-level language datasets. This affects the efficacy of large-scale models. We will further employ a larger motion-text dataset to evaluate MotionGPT. Besides, MotionGPTs introduce motion-language pre-training, as well as its zero-shot ability, which is a promising direction worth exploring and could stimulate self-training procedures for further research.
430435

@@ -434,7 +439,8 @@ The real challenge lies in reconstructing complex motions, such as diving or gym
434439

435440
<details>
436441
<summary>Visualize some of the tokens in the vocabulary that VQ-VAE learned.</summary>
437-
<img width="857" alt="figure13" src="./public/images/figure13.png">
442+
<img width="857" alt="figure13" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/bf8ceacb-e857-477d-bfe7-a0763b42c508">
443+
438444

439445
**Answer:** As shown in **Fig.13**, we visualize these **motion tokens** in **motion vocabulary $V_m$** and their corresponding localized spatial-temporal contexts, depicted within **4-frame motion segments**. However, MotionGPT falls short in generating descriptions for each individual token, as the training is conducted on token sequences.
440446

0 commit comments

Comments
 (0)