-
Notifications
You must be signed in to change notification settings - Fork 65
Deprecating the requirements.txt for dependencies management #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecating the requirements.txt for dependencies management #116
Conversation
fabianlim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just one small comment on the test workflow.
fabianlim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build/Dockerfile
Outdated
| RUN --mount=type=cache,target=/root/.cache/pip \ | ||
| python -m pip install --upgrade pip && \ | ||
| python -m pip install wheel && \ | ||
| python -m pip install "$(head bdist_name)[aim]" && \ | ||
| python -m pip install "$(head bdist_name)[flash-attn]" && \ | ||
| # Clean up the wheel module. It's only needed by flash-attn install | ||
| python -m pip uninstall wheel -y && \ | ||
| # Cleanup the bdist whl file | ||
| rm $(head bdist_name) /tmp/bdist_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anhuong Please review
I removed the [dev] optional dependencies.
The only package needed to install flash-attn is wheel.
The others packages in the [dev] are either installed(ie packaging, ninja) in build time or not needed(i.e sckit-learn).
This temp cache mount --mount=type=cache,target=/root/.cache/pip also reduced 3.1GB image size. 15.9G vs 12.8GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes generally look good to me but when building the image i got error:
Dockerfile:21
--------------------
20 | python -m pip install build
21 | >>> RUN --mount=source=.git,target=.git,type=bind \
22 | >>> --mount=type=bind,readwrite,source=tuning,target=tuning \
23 | >>> --mount=type=bind,source=pyproject.toml,target=pyproject.toml \
24 | >>> python -m build --wheel --outdir /tmp && \
25 | >>> ls /tmp/*.whl >/tmp/bdist_name
26 |
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref ucblo3xzuieskt2hd3dy0vnwp::5bhdpz76l1ss8oq6chind70tf: "/tuning": not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not seen this before. Here is my output look like:
(fms-hf-tuning) root@cataract1:~/fms-hf-tuning# docker build -t fms-hf-tuning:dev . -f build/Dockerfile
[+] Building 212.9s (27/27) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 5.32kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for registry.access.redhat.com/ubi9/ubi:latest 0.4s
=> [internal] load metadata for registry.access.redhat.com/ubi9/python-311:latest 0.3s
=> [internal] load build context 0.2s
=> => transferring context: 10.28MB 0.2s
=> [release 1/18] FROM registry.access.redhat.com/ubi9/ubi@sha256:66233eebd72bb5baa25190d4f55e1dc3fff3a9b77186c1f91a0abdb274452072 0.0s
=> [build 1/3] FROM registry.access.redhat.com/ubi9/python-311@sha256:78aa5022cb9e28652b5923162fee2e6e6192df84166abb5c7438910f70f9d02e 0.0s
=> CACHED [build 2/3] RUN --mount=type=cache,target=/root/.cache/pip python -m pip install --upgrade pip && python -m pip install build 0.0s
=> [build 3/3] RUN --mount=source=.git,target=.git,type=bind --mount=type=bind,readwrite,source=tuning,target=tuning --mount=type=bind,source=pyproject.toml,target=pyproject.toml python -m build --whee 7.7s
=> CACHED [release 2/18] RUN dnf remove -y --disableplugin=subscription-manager subscription-manager python3.11-requests && dnf install -y make procps && dnf clean all 0.0s
=> CACHED [release 3/18] RUN dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo && dnf install -y cuda-cudart-11-8-11.8.89-1 0.0s
=> CACHED [release 4/18] RUN dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo && dnf install -y cuda-libraries-11-8-11.8.0-1 0.0s
=> CACHED [release 5/18] RUN dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo && dnf install -y cuda-command-line-tools-11-8-1 0.0s
=> CACHED [release 6/18] RUN dnf install -y python3.11 git && ln -s /usr/bin/python3.11 /bin/python && python -m ensurepip --upgrade && dnf clean all 0.0s
=> CACHED [release 7/18] WORKDIR /tmp 0.0s
=> [release 8/18] COPY --from=build /tmp/*.whl /tmp/bdist_name /tmp/ 0.1s
=> [release 9/18] RUN --mount=type=cache,target=/root/.cache/pip python -m pip install --upgrade pip && python -m pip install wheel && python -m pip install "$(head bdist_name)[aim]" && python - 166.5s
=> [release 10/18] RUN mkdir -p /licenses 0.5s
=> [release 11/18] COPY LICENSE /licenses/ 0.1s
=> [release 12/18] RUN mkdir /app 0.3s
=> [release 13/18] COPY build/launch_training.py build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/ 0.1s
=> [release 14/18] COPY build/utils.py /app/build/ 0.1s
=> [release 15/18] RUN chmod +x /app/launch_training.py /app/accelerate_launch.py 0.3s
=> [release 16/18] RUN touch /.aim_profile && chmod -R 777 /.aim_profile && mkdir /.cache && chmod -R 777 /.cache 0.5s
=> [release 17/18] RUN useradd -u 1000 tuning -m -g 0 --system && chown -R tuning:0 /app /tmp && chmod -R g+rwX /app /tmp 0.6s
=> [release 18/18] WORKDIR /app 0.1s
=> exporting to image 34.3s
=> => exporting layers 34.3s
=> => writing image sha256:fae12f2d869b0eead3e2bfa05860e84317525404e643291c214c580c1740b37d 0.0s
=> => naming to docker.io/library/fms-hf-tuning:dev 0.0s
(fms-hf-tuning) root@cataract1:~/fms-hf-tuning#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, the error line for me is --mount=type=bind,readwrite,source=tuning,target=tuning, will look into it.
Also my bad, I think the image should not be building the wheel but pulling the wheel from pypi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error line for me is --mount=type=bind,readwrite,source=tuning,target=tuning
In stead of mounting could you try
COPY tuning tuning
Also my bad, I think the image should not be building the wheel but pulling the wheel from pypi
I think the idea of building whl in the image is to detect any changes in the repo which may break the image build early on. Similar to this
|
Hi @tedhtchang Built the image: Push the image to the local OpenShift image repo: Then I run ted's test script pointing to the local image registry And it looks good: |
Signed-off-by: ted chang <htchang@us.ibm.com>
anhuong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Ted! Note that we separated out the Dockerfile and build/release discussion into PR #126
…odel-stack#116) Signed-off-by: ted chang <htchang@us.ibm.com> Signed-off-by: aaron.chew1 <aaron.chew1@ibm.com>
Description of the change
Remove the python requirement.txt files and use only pyproject.toml to manage dependencies. We modify dependencies and optional-dependencies in same location.
Add build wheel and build image tests, etc
Related issue number
Closes #56
Closes #756
How to verify the PR
git clone --single-branch --branch deprecate-requirements https://github.com/tedhtchang/fms-hf-tuning.git cd fms-hf-tuningOr
3. Run e2e to test the SFTtrainer image. This command create kind cluster, install KFTO, Kueue, build whl and image, and create a pytorchjob workload using the built sft-trainer image.
Was the PR tested