feat: Enable LoRA saving only for non MoE linear layers training with kernels. #530

willmj · 2025-04-17T00:02:27Z

Description of the change

A more limited version of #523 to enable LoRA but explicitly block off router linear and expert layers.

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

github-actions · 2025-04-17T00:02:39Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

tests/test_sft_trainer.py

kmehant · 2025-04-18T04:54:40Z

tuning/config/acceleration_configs/fast_moe.py

+                        model = self.trainer.model
+                        if hasattr(model, "module"):
+                            model = model.module


is this needed for non lora use case? since after this step it would directly reach the below block

else: model.config.save_pretrained(hf_converted_output_dir)

It's needed to determine if there is a peft config for the following if statement, but in fine-tuning case this if statement should be false

tuning/sft_trainer.py

kmehant · 2025-04-18T05:04:29Z

@willmj thanks for diligently taking my inputs, after addressing my comments, I request you to run all tests related to moe kernels on your local setup and paste screenshots here for record. Thanks, appreciate this PR!

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

tuning/sft_trainer.py

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fabianlim · 2025-04-18T18:37:48Z

tuning/sft_trainer.py

            model, train_args, modifiable_args=(peft_config,)
        )
+        # For LoRa ScatterMoE, if expert layers are included, disable grad
+        if peft_config is not None:


I think its better to just do an instance check using ScatterMoE and then freeze everything inside.

maybe you should put a comment that this is a workaround, and in the future where ScatterMoE loras can be tuned, this code needs to be removed.

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fabianlim

Since @kmehant is not around I approve it. but TBH im not that familiar with the recent updates in this repo. But overall looks ok. Just one more comment

fabianlim · 2025-04-21T13:28:25Z

tuning/sft_trainer.py

            model, train_args, modifiable_args=(peft_config,)
        )
+        # For LoRa ScatterMoE, if expert layers are included, disable grad
+        if peft_config is not None:


maybe you should put a comment that this is a workaround, and in the future where ScatterMoE loras can be tuned, this code needs to be removed.

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

… kernels. (#530) * save peft Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * post process hf converted dir Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: convert hf converted checkpoint Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * lora config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * save adapter config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: add input linear and output linear to target modules Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: extend instead of append Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: if hasattr peft config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: remove unneeded target modules Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * test: lora for scattermoe Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * explitcitly don't support router layer Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: update documentation Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: simplify accelerate launch post processing Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * tests: more target modules + ep_degree Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: only restrict all-linear, raise warning for other modules Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: augmentation test Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: raise error Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * turn off requires grad if using scattermoe with lora Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: freeze scattermoe params Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: safer freezing Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

… kernels. (foundation-model-stack#530) * save peft Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * post process hf converted dir Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: convert hf converted checkpoint Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * lora config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * save adapter config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: add input linear and output linear to target modules Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: extend instead of append Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: if hasattr peft config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: remove unneeded target modules Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * test: lora for scattermoe Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * explitcitly don't support router layer Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: update documentation Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: simplify accelerate launch post processing Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * tests: more target modules + ep_degree Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: only restrict all-linear, raise warning for other modules Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: augmentation test Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: raise error Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * turn off requires grad if using scattermoe with lora Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: freeze scattermoe params Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: safer freezing Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj added 19 commits March 24, 2025 13:39

save peft

cac0b8c

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: model

c522429

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

post process hf converted dir

481dde6

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: convert hf converted checkpoint

397c9ba

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

lora config

79dec24

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

save adapter config

3103720

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fmt + comments

b61cbde

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: add input linear and output linear to target modules

c12be0e

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: extend instead of append

123c2d4

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: if hasattr peft config

f68500b

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: remove unneeded target modules

55ec4b5

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Merge branch 'main' into save-peft-fast-moe

0cfb9f4

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

merge: main into branch

67bed66

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

lint + fmt

2362349

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

docs

a848a9b

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

test: lora for scattermoe

42c420c

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fmt tests

e3e7525

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

docs: notes on restrictions

8449659

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

explitcitly don't support router layer

3c25265

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj requested review from aluu317, anhuong, dushyantbehl, fabianlim and kmehant as code owners April 17, 2025 00:02

willmj changed the title ~~Save peft fast moe limited~~ feat: save lora config fast moe -limited Apr 17, 2025

github-actions bot added the feat label Apr 17, 2025

docs: generalize

da81f93

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj force-pushed the save-peft-fast-moe-limited branch from 6b70cc8 to da81f93 Compare April 18, 2025 01:35

kmehant changed the title ~~feat: save lora config fast moe -limited~~ feat: Enable LoRA saving only for non MoE linear layers training with kernels. Apr 18, 2025

kmehant reviewed Apr 18, 2025

View reviewed changes

tests/test_sft_trainer.py Outdated Show resolved Hide resolved

kmehant reviewed Apr 18, 2025

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

willmj added 8 commits April 18, 2025 09:28

docs: update documentation

1424efd

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: simplify accelerate launch post processing

b67ef0f

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

tests: more target modules + ep_degree

6a32d32

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: only restrict all-linear, raise warning for other modules

d2b6153

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: augmentation test

765ec95

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: raise error

b0dea82

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: raise error

806b716

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Merge branch 'main' into save-peft-fast-moe-limited

f742e0b

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fabianlim reviewed Apr 18, 2025

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

willmj added 2 commits April 18, 2025 10:56

fix: make warning more general

2567d30

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

turn off requires grad if using scattermoe with lora

70468db

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fabianlim reviewed Apr 18, 2025

View reviewed changes

willmj added 6 commits April 18, 2025 14:48

fix: freeze scattermoe params

5b826c8

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: safer freezing

af408f9

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Merge branch 'main' into save-peft-fast-moe-limited

7b026a9

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

just use string for class name

d7c2d15

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

comment

0f7796e

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Merge branch 'main' into save-peft-fast-moe-limited

4e8d774

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fabianlim previously approved these changes Apr 21, 2025

View reviewed changes

add comment

1759a2f

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj dismissed fabianlim’s stale review via 1759a2f April 21, 2025 13:31

fabianlim approved these changes Apr 21, 2025

View reviewed changes

willmj merged commit 179da0a into foundation-model-stack:main Apr 21, 2025
9 checks passed

kmehant mentioned this pull request Apr 29, 2025

fix: incorrect check on fast moe activation #544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enable LoRA saving only for non MoE linear layers training with kernels. #530

feat: Enable LoRA saving only for non MoE linear layers training with kernels. #530

Uh oh!

willmj commented Apr 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 17, 2025

Uh oh!

Uh oh!

kmehant Apr 18, 2025

Uh oh!

willmj Apr 18, 2025

Uh oh!

Uh oh!

kmehant commented Apr 18, 2025

Uh oh!

Uh oh!

fabianlim Apr 18, 2025

Uh oh!

fabianlim Apr 21, 2025

Uh oh!

fabianlim left a comment

Uh oh!

fabianlim Apr 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Enable LoRA saving only for non MoE linear layers training with kernels. #530

feat: Enable LoRA saving only for non MoE linear layers training with kernels. #530

Uh oh!

Conversation

willmj commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

github-actions bot commented Apr 17, 2025

Uh oh!

Uh oh!

kmehant Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

willmj Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kmehant commented Apr 18, 2025

Uh oh!

Uh oh!

fabianlim Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

fabianlim Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

fabianlim left a comment

Choose a reason for hiding this comment

Uh oh!

fabianlim Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willmj commented Apr 17, 2025 •

edited

Loading