from2001
diff --git a/‎README.md‎
Lines changed: 6 additions & 6 deletions b/‎README.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎mGPT/data/HumanML3D.py‎
Lines changed: 5 additions & 1 deletion b/‎mGPT/data/HumanML3D.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎mGPT/data/Kit.py‎
Lines changed: 5 additions & 10 deletions b/‎mGPT/data/Kit.py‎
Lines changed: 5 additions & 10 deletions
@@ -260,7 +260,6 @@ optional parameters:
     <summary>Instruction tuning and zero-shot learning.</summary>
 <img width="853" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/4b5985b3-2a26-4b09-80a0-05a15343bf23">
 
-
 **Answer:** We propose instruction tuning to **train a single MotionGPT across all motion-related tasks**, while task-specific tuning is to train and evaluate MotionGPTs on a single task. We employ these two training schemes to study the ability of MotionGPT across multi-tasks. As shown in this figure, we provide **zero-shot cases**. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like "**scuttling**" and "**barriers**", and generate correct motions based on the meaning of sentences. However, it still struggles to generate **unseen motions**, like gymnastics, even if MotionGPTs understand the text inputs.
 
 </details>
@@ -276,8 +275,6 @@ optional parameters:
     <summary>How well MotionGPT learns the relationship between motion and language?</summary>
 <img width="300" alt="figure10" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/a27abc97-ead2-4abd-a32c-e14049ba2421"><img width="600" alt="figure12" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/c82c1aee-c3e5-4090-8ddd-d0c78aae3330">
 
-
-
 **Answer:** **Unlike** the previous motion generators using the **text encoder of CLIP** for conditions, please note that MotionGPTs leverage language models to learn the motion-language relationship, instead of relying on text features from CLIP. According to our zero-shot results (cf. **Fig. 12**) and performances on multi-tasks (cf. **Fig. 10**), MotionGPTs establish robust connections between simple/complex texts and simple motions in evaluations, but they fall short when it comes to complex-text to **complex motion translation**.
 
 </details>
@@ -288,8 +285,6 @@ optional parameters:
     <summary>Why choose T5, an encoder-decoder architecture, as the base model? How about a decoder-only model, like LLaMA?</summary>
 <img width="866" alt="table15" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/8f58ee1e-6a10-4b5c-9939-f79ba2ecccae">
 
-
-
 **Answer:** The **first language model that we used** to build MotionGPTs is **LLaMA-13B**. However, it shows insufficient performance and low training efficiency. We assume the reason is the limited dataset size compared to the large parameters and language data of LLaMA. We tried a smaller size decoder-only backbone **GPT2-Medium** and provide the results in **Tab. 15**. Then, we thus chose **T5-770M**, a small but common language model, as our final backbone, because many previous vision-language multimodal works, like **Unified-IO** and **BLIP**, have chosen T5, this encoder-decoder architecture. It shows a strong power to address multi-modal tasks. In addition, the decoder-only model has the advantage for self-supervised without pair data while we have paired data which this advance is greatly weakened. We are still working on collecting a large motion dataset for larger motion-language models.
 
 </details>
@@ -441,9 +436,14 @@ The real challenge lies in reconstructing complex motions, such as diving or gym
     <summary>Visualize some of the tokens in the vocabulary that VQ-VAE learned.</summary>
 <img width="857" alt="figure13" src="https://github.com/OpenMotionLab/MotionGPT/assets/120085716/bf8ceacb-e857-477d-bfe7-a0763b42c508">
 
-
 **Answer:** As shown in **Fig.13**, we visualize these **motion tokens** in **motion vocabulary $V_m$** and their corresponding localized spatial-temporal contexts, depicted within **4-frame motion segments**. However, MotionGPT falls short in generating descriptions for each individual token, as the training is conducted on token sequences.
 
+You can run the script below to visualize more tokens:
+
+```
+python -m scripts.get_code_visual --cfg configs/config_h3d_stage2.yaml
+```
+
 </details>
 </details>
 
 
@@ -1,5 +1,6 @@
 import numpy as np
 import torch
+import os 
 from os.path import join as pjoin
 from .humanml.utils.word_vectorizer import WordVectorizer
 from .humanml.scripts.motion_process import (process_file, recover_from_ric)
@@ -85,7 +86,10 @@ def feats2joints(self, features):
         return recover_from_ric(features, self.njoints)
 
     def joints2feats(self, features):
-        features = process_file(features, self.njoints)[0]
+        example_data = np.load(os.path.join(self.hparams.data_root, 'joints', '000021.npy'))
+        example_data = example_data.reshape(len(example_data), -1, 3)
+        example_data = torch.from_numpy(example_data)
+        features = process_file(features, self.njoints, example_data, 't2m')[0]
         return features
 
     def normalize(self, features):
 
@@ -1,5 +1,6 @@
 import numpy as np
 import torch
+import os 
 from os.path import join as pjoin
 from .humanml.utils.word_vectorizer import WordVectorizer
 from .humanml.scripts.motion_process import (process_file, recover_from_ric)
@@ -45,17 +46,11 @@ def __init__(self, cfg, **kwargs):
         self.nfeats = self._sample_set.nfeats
         cfg.DATASET.NFEATS = self.nfeats
 
-    def feats2joints(self, features):
-        mean = torch.tensor(self.hparams.mean).to(features)
-        std = torch.tensor(self.hparams.std).to(features)
-        features = features * std + mean
-        return recover_from_ric(features, self.njoints)
-
     def joints2feats(self, features):
-        features = process_file(features, self.njoints)[0]
-        # mean = torch.tensor(self.hparams.mean).to(features)
-        # std = torch.tensor(self.hparams.std).to(features)
-        # features = (features - mean) / std
+        example_data = np.load(os.path.join(self.hparams.data_root, 'joints', '03950_gt.npy'))
+        example_data = example_data.reshape(len(example_data), -1, 3)
+        example_data = torch.from_numpy(example_data)
+        features = process_file(features, self.njoints, example_data, 'kit')[0]
         return features
 
     def normalize(self, features):