-
Notifications
You must be signed in to change notification settings - Fork 570
Graduate flux from experiment folder to core torchtitan #1858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left some comments.
tests/integration_tests/flux.py
Outdated
parser.add_argument( | ||
"--config_path", | ||
default="./torchtitan/experiments/flux/train_configs/debug_model.toml", | ||
default="./tests/integration_tests/base_config.toml", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this I think you'll need to have the custom job config import in all FLUX OverrideDefinitions
. BTW I plan to refactor this so that we don't need to always specify the custom impot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice catch! I double check our current design - we used the base_config.toml to enable corss-model tests are consistent, but for flux model it has a separate integration test file, so it's ok to directly reuse the debug model, instead of adding bunch of parameters that are not contained in (not only the custom_job_config, but also inference, validation, etc). WDYT?
) | ||
|
||
self.hf_module = self.hf_module.eval().requires_grad_(False) | ||
# This is to make sure the encoders works with FSDP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share more context around this change? I vaguely remember someone requested this in an issue but we couldn't reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a new change need to run FLUX model with HuggingFace encoders. In my recent test, I find the HF encoder parameters are all stride tensor, which is not compatible with FSDP. So I manually change the encoder's parameters to continuous tensor.
could you also rebase onto #1871 |
Sure, will rebase later |
26ef236
to
3f607a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. Thanks for the refactor! Had some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more logical organization of files, you think we can put clip and t5 config.json
into tests/assets/flux_test_encoders/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to rebase onto #1851
FLUX.1 model is a diffusion model, which is different from language models and needs to extend train.py as needed.
train.py
andrun_train.sh
, I kept a copy ofrun_tests()
instead of generalizingintegration_tests/run_test.py