Skip to content

Conversation

@c-salomonsen
Copy link
Contributor

@sot176 I had to change a few things in your dataset to make it fit in with the format, however I'm not sure my changes are what we'll actually want in the end, which raises the question: how do we point data_path to our dataset? and do we always assume the datasets are stored as files, or directories?

@c-salomonsen c-salomonsen added the enhancement New feature or request label Feb 6, 2025
@c-salomonsen c-salomonsen self-assigned this Feb 6, 2025
@c-salomonsen
Copy link
Contributor Author

I also am not sure how we ought to structure our targets. Can we expect them to always be an int, or should the dataloader one-hot encode them right away, thus expecting an array or tensor?

@c-salomonsen c-salomonsen linked an issue Feb 6, 2025 that may be closed by this pull request
Copy link
Contributor

@sot176 sot176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our dataset is a single HDF5 file (at least in the Kaggle version), I think the best solution is to ensure that data_path is a file path rather than a directory.

I usually pass data_path dynamically to the dataset loader via argparse and specify the path in my shell script.

Regarding the targets, I think we can assume they are always integers. However, it might be beneficial to ensure that all datasets follow the same structure for labels and decide on a standard format—whether to keep them as integers or convert them to one-hot encoding.

@c-salomonsen
Copy link
Contributor Author

Since our dataset is a single HDF5 file (at least in the Kaggle version), I think the best solution is to ensure that data_path is a file path rather than a directory.

Yes I agree, I just didnt know if that was the case for others as well, do you know?

Regarding the targets, I think we can assume they are always integers. However, it might be beneficial to ensure that all datasets follow the same structure for labels and decide on a standard format—whether to keep them as integers or convert them to one-hot encoding.

Yep, I kinda assumed that at least for the loss function–which is cross-entropy–we had to one-hot encode the targets, but idk.

@sot176
Copy link
Contributor

sot176 commented Feb 7, 2025

The CrossEntropyLoss function from Pytorch expects the target labels to be integers representing class indices (not one-hot encoded) :)

@sot176
Copy link
Contributor

sot176 commented Feb 7, 2025

Regarding the file path: I unfortunately don't know. Maybe we open an issue to discuss that and decide on one way?

@hzavadil98
Copy link
Contributor

Hi, in #53 I threw a hammer into the datasets a little bit - the parameters passed through load_data are a bit differen, but it shouldnt affect the test for it I believe. Regarding the data_path argument - I store the mnist files in ./Data/MNIST folder and data_path points to it, but the mnist is split into 4 separate files.

Copy link
Contributor

@hzavadil98 hzavadil98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test looks good!

@c-salomonsen
Copy link
Contributor Author

The tests fails on gh-actions since its not capable of downloading the datasets. I'll make some changes to instead assume data is present somehow, or just skip checking the lengths.

@c-salomonsen
Copy link
Contributor Author

Hi, in #53 I threw a hammer into the datasets a little bit - the parameters passed through load_data are a bit differen, but it shouldnt affect the test for it I believe. Regarding the data_path argument - I store the mnist files in ./Data/MNIST folder and data_path points to it, but the mnist is split into 4 separate files.

So perhaps the load_data should just take a directory path, then a fixed filename is stored in each class, i.e.:

class MyDataset(Dataset):
    filename = "mydataset.data"
    def __init__(self, data_dir: pathlib.Path | str, ...):
        self.data_path = data_dir / self.filename  # ?

@c-salomonsen
Copy link
Contributor Author

Btw, while this PR implements a test for load_data, I made a test for the MetricWrapper/load_metrics in #54, here.

So what remains is a test for load_model. Sorry for the confusing structure with having tests split in two PR's...

@c-salomonsen c-salomonsen marked this pull request as ready for review February 13, 2025 12:04
@Seilmast Seilmast merged commit 018b669 into main Feb 13, 2025
3 of 4 checks passed
@hzavadil98 hzavadil98 deleted the christian/test-model-metric-data branch February 13, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discussion: Train/val/test splitting and dataset generation

6 participants