Skip to content

Conversation

@tedhtchang
Copy link
Collaborator

@tedhtchang tedhtchang commented Apr 11, 2024

Description of the change

Remove the python requirement.txt files and use only pyproject.toml to manage dependencies. We modify dependencies and optional-dependencies in same location.
Add build wheel and build image tests, etc

Related issue number

Closes #56
Closes #756

How to verify the PR

  1. Checkout branch
git clone --single-branch --branch deprecate-requirements https://github.com/tedhtchang/fms-hf-tuning.git
cd fms-hf-tuning
  1. Build whl and run tests. tox reads the pyprjoect.toml and create isolated venv and run test.
tox -e py
# build bdist whl in dist/
tox -e build
# see files in the whl
unzip -l dist/*.whl

Or
3. Run e2e to test the SFTtrainer image. This command create kind cluster, install KFTO, Kueue, build whl and image, and create a pytorchjob workload using the built sft-trainer image.

curl https://gist.githubusercontent.com/tedhtchang/34335240230d64c6e2ecb5b78d8a8cb0/raw/6c568c7fdc82ed9e344474da5278ec43a6fc788b/kind-e2e-sft-mini.sh | bash

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@tedhtchang tedhtchang changed the title Deprecating the requirements files Deprecating the requirements files for dependencies management Apr 11, 2024
@tedhtchang tedhtchang changed the title Deprecating the requirements files for dependencies management Deprecating the requirements.txt for dependencies management Apr 11, 2024
@Ssukriti Ssukriti requested review from fabianlim and kmehant and removed request for anhuong April 11, 2024 23:52
fabianlim
fabianlim previously approved these changes Apr 12, 2024
Copy link
Collaborator

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just one small comment on the test workflow.

@tedhtchang
Copy link
Collaborator Author

@gkumbhat Could you review.
@anhuong I changed the Dockerfile in this PR could you take a look if it will break any thing.

@tedhtchang tedhtchang requested review from anhuong and fabianlim April 13, 2024 11:39
@fabianlim fabianlim self-requested a review April 14, 2024 13:10
fabianlim
fabianlim previously approved these changes Apr 14, 2024
Copy link
Collaborator

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

build/Dockerfile Outdated
Comment on lines 113 to 121
RUN --mount=type=cache,target=/root/.cache/pip \
python -m pip install --upgrade pip && \
python -m pip install wheel && \
python -m pip install "$(head bdist_name)[aim]" && \
python -m pip install "$(head bdist_name)[flash-attn]" && \
# Clean up the wheel module. It's only needed by flash-attn install
python -m pip uninstall wheel -y && \
# Cleanup the bdist whl file
rm $(head bdist_name) /tmp/bdist_name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhuong Please review
I removed the [dev] optional dependencies.
The only package needed to install flash-attn is wheel.
The others packages in the [dev] are either installed(ie packaging, ninja) in build time or not needed(i.e sckit-learn).
This temp cache mount --mount=type=cache,target=/root/.cache/pip also reduced 3.1GB image size. 15.9G vs 12.8GB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally look good to me but when building the image i got error:

Dockerfile:21

--------------------

  20 |         python -m pip install build

  21 | >>> RUN --mount=source=.git,target=.git,type=bind \

  22 | >>>     --mount=type=bind,readwrite,source=tuning,target=tuning \

  23 | >>>     --mount=type=bind,source=pyproject.toml,target=pyproject.toml \

  24 | >>>     python -m build --wheel --outdir /tmp && \

  25 | >>>     ls /tmp/*.whl >/tmp/bdist_name

  26 |     

--------------------

ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref ucblo3xzuieskt2hd3dy0vnwp::5bhdpz76l1ss8oq6chind70tf: "/tuning": not found

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen this before. Here is my output look like:

(fms-hf-tuning) root@cataract1:~/fms-hf-tuning# docker build -t fms-hf-tuning:dev . -f build/Dockerfile
[+] Building 212.9s (27/27) FINISHED                                                                                                                                                                          docker:default
 => [internal] load build definition from Dockerfile                                                                                                                                                                    0.0s
 => => transferring dockerfile: 5.32kB                                                                                                                                                                                  0.0s
 => [internal] load .dockerignore                                                                                                                                                                                       0.0s
 => => transferring context: 2B                                                                                                                                                                                         0.0s
 => [internal] load metadata for registry.access.redhat.com/ubi9/ubi:latest                                                                                                                                             0.4s
 => [internal] load metadata for registry.access.redhat.com/ubi9/python-311:latest                                                                                                                                      0.3s
 => [internal] load build context                                                                                                                                                                                       0.2s
 => => transferring context: 10.28MB                                                                                                                                                                                    0.2s
 => [release  1/18] FROM registry.access.redhat.com/ubi9/ubi@sha256:66233eebd72bb5baa25190d4f55e1dc3fff3a9b77186c1f91a0abdb274452072                                                                                    0.0s
 => [build 1/3] FROM registry.access.redhat.com/ubi9/python-311@sha256:78aa5022cb9e28652b5923162fee2e6e6192df84166abb5c7438910f70f9d02e                                                                                 0.0s
 => CACHED [build 2/3] RUN --mount=type=cache,target=/root/.cache/pip     python -m pip install --upgrade pip &&     python -m pip install build                                                                        0.0s
 => [build 3/3] RUN --mount=source=.git,target=.git,type=bind     --mount=type=bind,readwrite,source=tuning,target=tuning     --mount=type=bind,source=pyproject.toml,target=pyproject.toml     python -m build --whee  7.7s
 => CACHED [release  2/18] RUN dnf remove -y --disableplugin=subscription-manager         subscription-manager         python3.11-requests     && dnf install -y make         procps     && dnf clean all               0.0s
 => CACHED [release  3/18] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-cudart-11-8-11.8.89-1      0.0s
 => CACHED [release  4/18] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-libraries-11-8-11.8.0-1    0.0s
 => CACHED [release  5/18] RUN dnf config-manager        --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo     && dnf install -y         cuda-command-line-tools-11-8-1  0.0s
 => CACHED [release  6/18] RUN dnf install -y python3.11 git &&     ln -s /usr/bin/python3.11 /bin/python &&     python -m ensurepip --upgrade &&     dnf clean all                                                     0.0s
 => CACHED [release  7/18] WORKDIR /tmp                                                                                                                                                                                 0.0s
 => [release  8/18] COPY --from=build /tmp/*.whl /tmp/bdist_name /tmp/                                                                                                                                                  0.1s
 => [release  9/18] RUN --mount=type=cache,target=/root/.cache/pip     python -m pip install --upgrade pip &&     python -m pip install wheel &&     python -m pip install "$(head bdist_name)[aim]" &&     python -  166.5s
 => [release 10/18] RUN mkdir -p /licenses                                                                                                                                                                              0.5s
 => [release 11/18] COPY LICENSE /licenses/                                                                                                                                                                             0.1s
 => [release 12/18] RUN mkdir /app                                                                                                                                                                                      0.3s
 => [release 13/18] COPY build/launch_training.py build/accelerate_launch.py fixtures/accelerate_fsdp_defaults.yaml /app/                                                                                               0.1s
 => [release 14/18] COPY build/utils.py /app/build/                                                                                                                                                                     0.1s
 => [release 15/18] RUN chmod +x /app/launch_training.py /app/accelerate_launch.py                                                                                                                                      0.3s
 => [release 16/18] RUN touch /.aim_profile &&     chmod -R 777 /.aim_profile &&     mkdir /.cache &&     chmod -R 777 /.cache                                                                                          0.5s
 => [release 17/18] RUN useradd -u 1000 tuning -m -g 0 --system &&     chown -R tuning:0 /app /tmp &&     chmod -R g+rwX /app /tmp                                                                                      0.6s
 => [release 18/18] WORKDIR /app                                                                                                                                                                                        0.1s
 => exporting to image                                                                                                                                                                                                 34.3s
 => => exporting layers                                                                                                                                                                                                34.3s
 => => writing image sha256:fae12f2d869b0eead3e2bfa05860e84317525404e643291c214c580c1740b37d                                                                                                                            0.0s
 => => naming to docker.io/library/fms-hf-tuning:dev                                                                                                                                                                    0.0s
(fms-hf-tuning) root@cataract1:~/fms-hf-tuning#

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, the error line for me is --mount=type=bind,readwrite,source=tuning,target=tuning, will look into it.

Also my bad, I think the image should not be building the wheel but pulling the wheel from pypi

Copy link
Collaborator Author

@tedhtchang tedhtchang Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error line for me is --mount=type=bind,readwrite,source=tuning,target=tuning

In stead of mounting could you try

COPY tuning tuning

Also my bad, I think the image should not be building the wheel but pulling the wheel from pypi

I think the idea of building whl in the image is to detect any changes in the repo which may break the image build early on. Similar to this

@jbusche
Copy link
Collaborator

jbusche commented Apr 16, 2024

Hi @tedhtchang
I tried this and it worked on my OpenShift 4.14.17 cluster:

Built the image:

git clone --single-branch --branch deprecate-requirements https://github.com/tedhtchang/fms-hf-tuning.git
cd fms-hf-tuning
docker build -t fms-hf-tuning:dev . -f build/Dockerfile

Push the image to the local OpenShift image repo:

podman login -u kubeadmin -p $(oc whoami -t) $(oc registry info) --tls-verify=false
        
podman tag localhost/fms-hf-tuning:dev $(oc registry info)/opendatahub/fms-hf-tuning:dev
podman push  --tls-verify=false $(oc registry info)/opendatahub/fms-hf-tuning:dev

Then I run ted's test script pointing to the local image registry image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:dev

And it looks good:

oc get pytorchjobs
NAME           STATE       AGE
ted-kfto-sft   Succeeded   18m

oc get pods
NAME                                        READY   STATUS      RESTARTS   AGE
ted-kfto-sft-master-0                       0/1     Completed   0          16m

Signed-off-by: ted chang <htchang@us.ibm.com>
Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Ted! Note that we separated out the Dockerfile and build/release discussion into PR #126

@anhuong anhuong merged commit db99b28 into foundation-model-stack:main Apr 17, 2024
achew010 pushed a commit to achew010/fms-hf-tuning that referenced this pull request May 6, 2024
…odel-stack#116)

Signed-off-by: ted chang <htchang@us.ibm.com>
Signed-off-by: aaron.chew1 <aaron.chew1@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python package dependency management for building/development

4 participants