Skip to content

Commit 30527a4

Browse files
committed
Update token visualization
1 parent 3b6ef61 commit 30527a4

File tree

5 files changed

+344
-1176
lines changed

5 files changed

+344
-1176
lines changed

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,6 @@ optional parameters:
260260
<summary>Instruction tuning and zero-shot learning.</summary>
261261
<img width="853" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/4b5985b3-2a26-4b09-80a0-05a15343bf23">
262262

263-
264263
**Answer:** We propose instruction tuning to **train a single MotionGPT across all motion-related tasks**, while task-specific tuning is to train and evaluate MotionGPTs on a single task. We employ these two training schemes to study the ability of MotionGPT across multi-tasks. As shown in this figure, we provide **zero-shot cases**. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like "**scuttling**" and "**barriers**", and generate correct motions based on the meaning of sentences. However, it still struggles to generate **unseen motions**, like gymnastics, even if MotionGPTs understand the text inputs.
265264

266265
</details>
@@ -276,8 +275,6 @@ optional parameters:
276275
<summary>How well MotionGPT learns the relationship between motion and language?</summary>
277276
<img width="300" alt="figure10" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/a27abc97-ead2-4abd-a32c-e14049ba2421"><img width="600" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/c82c1aee-c3e5-4090-8ddd-d0c78aae3330">
278277

279-
280-
281278
**Answer:** **Unlike** the previous motion generators using the **text encoder of CLIP** for conditions, please note that MotionGPTs leverage language models to learn the motion-language relationship, instead of relying on text features from CLIP. According to our zero-shot results (cf. **Fig. 12**) and performances on multi-tasks (cf. **Fig. 10**), MotionGPTs establish robust connections between simple/complex texts and simple motions in evaluations, but they fall short when it comes to complex-text to **complex motion translation**.
282279

283280
</details>
@@ -288,8 +285,6 @@ optional parameters:
288285
<summary>Why choose T5, an encoder-decoder architecture, as the base model? How about a decoder-only model, like LLaMA?</summary>
289286
<img width="866" alt="table15" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/8f58ee1e-6a10-4b5c-9939-f79ba2ecccae">
290287

291-
292-
293288
**Answer:** The **first language model that we used** to build MotionGPTs is **LLaMA-13B**. However, it shows insufficient performance and low training efficiency. We assume the reason is the limited dataset size compared to the large parameters and language data of LLaMA. We tried a smaller size decoder-only backbone **GPT2-Medium** and provide the results in **Tab. 15**. Then, we thus chose **T5-770M**, a small but common language model, as our final backbone, because many previous vision-language multimodal works, like **Unified-IO** and **BLIP**, have chosen T5, this encoder-decoder architecture. It shows a strong power to address multi-modal tasks. In addition, the decoder-only model has the advantage for self-supervised without pair data while we have paired data which this advance is greatly weakened. We are still working on collecting a large motion dataset for larger motion-language models.
294289

295290
</details>
@@ -441,9 +436,14 @@ The real challenge lies in reconstructing complex motions, such as diving or gym
441436
<summary>Visualize some of the tokens in the vocabulary that VQ-VAE learned.</summary>
442437
<img width="857" alt="figure13" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/bf8ceacb-e857-477d-bfe7-a0763b42c508">
443438

444-
445439
**Answer:** As shown in **Fig.13**, we visualize these **motion tokens** in **motion vocabulary $V_m$** and their corresponding localized spatial-temporal contexts, depicted within **4-frame motion segments**. However, MotionGPT falls short in generating descriptions for each individual token, as the training is conducted on token sequences.
446440

441+
You can run the script below to visualize more tokens:
442+
443+
```
444+
python -m scripts.get_code_visual --cfg configs/config_h3d_stage2.yaml
445+
```
446+
447447
</details>
448448
</details>
449449

mGPT/data/HumanML3D.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import numpy as np
22
import torch
3+
import os
34
from os.path import join as pjoin
45
from .humanml.utils.word_vectorizer import WordVectorizer
56
from .humanml.scripts.motion_process import (process_file, recover_from_ric)
@@ -85,7 +86,10 @@ def feats2joints(self, features):
8586
return recover_from_ric(features, self.njoints)
8687

8788
def joints2feats(self, features):
88-
features = process_file(features, self.njoints)[0]
89+
example_data = np.load(os.path.join(self.hparams.data_root, 'joints', '000021.npy'))
90+
example_data = example_data.reshape(len(example_data), -1, 3)
91+
example_data = torch.from_numpy(example_data)
92+
features = process_file(features, self.njoints, example_data, 't2m')[0]
8993
return features
9094

9195
def normalize(self, features):

mGPT/data/Kit.py

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import numpy as np
22
import torch
3+
import os
34
from os.path import join as pjoin
45
from .humanml.utils.word_vectorizer import WordVectorizer
56
from .humanml.scripts.motion_process import (process_file, recover_from_ric)
@@ -45,17 +46,11 @@ def __init__(self, cfg, **kwargs):
4546
self.nfeats = self._sample_set.nfeats
4647
cfg.DATASET.NFEATS = self.nfeats
4748

48-
def feats2joints(self, features):
49-
mean = torch.tensor(self.hparams.mean).to(features)
50-
std = torch.tensor(self.hparams.std).to(features)
51-
features = features * std + mean
52-
return recover_from_ric(features, self.njoints)
53-
5449
def joints2feats(self, features):
55-
features = process_file(features, self.njoints)[0]
56-
# mean = torch.tensor(self.hparams.mean).to(features)
57-
# std = torch.tensor(self.hparams.std).to(features)
58-
# features = (features - mean) / std
50+
example_data = np.load(os.path.join(self.hparams.data_root, 'joints', '03950_gt.npy'))
51+
example_data = example_data.reshape(len(example_data), -1, 3)
52+
example_data = torch.from_numpy(example_data)
53+
features = process_file(features, self.njoints, example_data, 'kit')[0]
5954
return features
6055

6156
def normalize(self, features):

0 commit comments

Comments
 (0)