-
Notifications
You must be signed in to change notification settings - Fork 5
Nemo curator laion #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ples into nemo-curator-dedup
| return asyncio.run(process_batch(batch, output_dir, batch_num)) | ||
|
|
||
|
|
||
| def download_webdataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes the whole dataset fits on disk in one machine, right? (Fine for LAION since it is just URLs, but probably not in general.)
What's the best way to get data into NeMo Curator? E.g., would it make sense to use Ray Data to read the data and stream it in? Or does NeMo Curator have methods for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since nemo curator uses nvidia DALI, I think the ideal data loading story would be to have all the images in something like s3 partitioned into different tar shards. We can then mount the s3 on each of the nodes, with each node accessing the subset of tar shards that it is computing on. Would you like me to build this into the example?
No description provided.