Minimal implementation of muP scaling for Llama #304

daviswer · 2024-07-22T14:34:06Z

Implement muP scaling for Llama models. Model follows muP scaling laws but introduces the minimal set of extra tunable hyperparameters that allows us to recover prior behavior - thus may not be compatible (yet) with existing muP configs. See here for training script changes.

Introduce extra muP params to Llama config
Calculate base values that allow us to mimic the training behavior of default Llama-194M config
Swap out trunc_normal_ init for normal_ (redundant without specifying the clamp values, plus we noticed some FSDP-related issues in the speculator setting)
Adjust reset_parameters so that each module initializes itself but no submodules (following FSDP init_fn contract)

Note that this is currently only implemented for Llama models, and does not support the old constant-range Llama init scheme. Additional work will be required to make these compatible; should we decide to support MuP then this is just a starting off / reference point.

daviswer · 2024-07-22T14:57:43Z

Noting that model equivalency checks are failing because muP introduces extra scaling terms into the forward pass, and the default mup scaling param values only correct for this at 194m scale.

daviswer added 7 commits July 17, 2024 19:15

mup ops for ffn and attn

f3b3c39

Model passes mup flags for init

182b416

Shift qkv init into qkv

dba3f5e

Add dummy scale args to reset_param

8d2382d

no truncs

4f5189a

1d mup init

cce208a

Linting

486833c

daviswer mentioned this pull request Jul 22, 2024

Minimal implementation of muP scaling for Llama foundation-model-stack/fms-fsdp#98

Open

daviswer added 2 commits October 8, 2024 16:26

Fix attn temp, head dim not layer dim

273f80c

Descale a bunch of defaults

d3cd57b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimal implementation of muP scaling for Llama #304

Minimal implementation of muP scaling for Llama #304

Uh oh!

daviswer commented Jul 22, 2024 •

edited

Loading

Uh oh!

daviswer commented Jul 22, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Minimal implementation of muP scaling for Llama #304

Are you sure you want to change the base?

Minimal implementation of muP scaling for Llama #304

Uh oh!

Conversation

daviswer commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviswer commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daviswer commented Jul 22, 2024 •

edited

Loading

daviswer commented Jul 22, 2024 •

edited

Loading