Nemo curator laion #31

avigyabb · 2025-12-31T00:25:46Z

No description provided.

…ples into nemo-curator-dedup

robertnishihara · 2026-01-02T00:12:32Z

nemo_curator_semantic_dedup/helper.py

+    return asyncio.run(process_batch(batch, output_dir, batch_num))
+
+
+def download_webdataset(


This assumes the whole dataset fits on disk in one machine, right? (Fine for LAION since it is just URLs, but probably not in general.)

What's the best way to get data into NeMo Curator? E.g., would it make sense to use Ray Data to read the data and stream it in? Or does NeMo Curator have methods for this?

Since nemo curator uses nvidia DALI, I think the ideal data loading story would be to have all the images in something like s3 partitioned into different tar shards. We can then mount the s3 on each of the nodes, with each node accessing the subset of tar shards that it is computing on. Would you like me to build this into the example?

nemo_curator_semantic_dedup/helper.py

nemo_curator_semantic_dedup/Dockerfile

avigyabb and others added 12 commits November 27, 2025 01:07

dedup working

342e166

working job submit version

a812e2c

remove unnecessary file

a5cd5b3

changed custom compute config

15015fc

Merge branch 'nemo-curator-dedup' of https://github.com/avigyabb/exam…

449947a

…ples into nemo-curator-dedup

working version

89019ba

minimal working version

c22c1da

working

4a8a73f

working

507db13

working w/ laion

2ede91c

working version at scale

e2323f2

job yaml cleanup

82aa265

robertnishihara reviewed Jan 2, 2026

View reviewed changes

nemo_curator_semantic_dedup/helper.py Outdated Show resolved Hide resolved

robertnishihara reviewed Jan 2, 2026

View reviewed changes

nemo_curator_semantic_dedup/Dockerfile Outdated Show resolved Hide resolved

robertnishihara reviewed Jan 2, 2026

View reviewed changes

nemo_curator_semantic_dedup/Dockerfile Outdated Show resolved Hide resolved

avigyabb added 2 commits January 2, 2026 07:40

removed unnecessary functions in helper.py

06d6742

changes to dockerfile

be4100e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nemo curator laion #31

Nemo curator laion #31

avigyabb commented Dec 31, 2025

Uh oh!

robertnishihara Jan 2, 2026

Uh oh!

avigyabb Jan 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return asyncio.run(process_batch(batch, output_dir, batch_num))


		def download_webdataset(

Nemo curator laion #31

Are you sure you want to change the base?

Nemo curator laion #31

Conversation

avigyabb commented Dec 31, 2025

Uh oh!

robertnishihara Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

avigyabb Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants