The data in Section 3 and Section 4 don’t seem to align. For example:
In Section 4, training stage 1 (General Capability Training) uses 4.8M data samples, but Section 3 only describes 1.6M general image-text data. It’s unclear which specific datasets from section 3 account for the remainder.
Stages 2 (Embodied Spatial-Temporal Training) and 3 (CoT Reasoning) combined about 500K data samples, whereas the "Share Robot" data in Section 3 alone for temporal planning amounts to 1M already.