Skip to content

Conversation

@avigyabb
Copy link

No description provided.

return asyncio.run(process_batch(batch, output_dir, batch_num))


def download_webdataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the whole dataset fits on disk in one machine, right? (Fine for LAION since it is just URLs, but probably not in general.)

What's the best way to get data into NeMo Curator? E.g., would it make sense to use Ray Data to read the data and stream it in? Or does NeMo Curator have methods for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since nemo curator uses nvidia DALI, I think the ideal data loading story would be to have all the images in something like s3 partitioned into different tar shards. We can then mount the s3 on each of the nodes, with each node accessing the subset of tar shards that it is computing on. Would you like me to build this into the example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants