-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add in_memory option to PersistentDataset for hybrid caching #8691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Add in_memory option to PersistentDataset for hybrid caching #8691
Conversation
- Add `in_memory` parameter to PersistentDataset that combines persistent disk storage with RAM caching for faster subsequent access - Add `memory_cache_size` property for inspecting cache state - Compute cache key once in _cachecheck() to avoid redundant hash computation - Clear memory cache when set_data() is called - Works with or without cache_dir (pure RAM cache mode) Addresses: Project-MONAI#6753 Signed-off-by: Soumya Snigdha Kundu <soumya_snigdha.kundu@kcl.ac.uk>
📝 WalkthroughWalkthroughThe pull request introduces an in-memory caching layer to Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @tests/data/test_persistentdataset.py:
- Around line 294-296: The test expects untransformed data but the used
transform class _InplaceXform mutates data by adding np.pi to data[0], so update
the assertion in the results loop to account for that transform: compute the
expected row as list(range(i)) then add np.pi to its first element (or use numpy
to build expected and add np.pi to expected[0,0]) and compare with results using
a numeric comparison (e.g., np.testing.assert_allclose) instead of plain
equality so cached, transformed values like [[np.pi, 1]] match the expectation.
🧹 Nitpick comments (3)
tests/data/test_persistentdataset.py (1)
231-234: Consider testing object identity for memory cache hits.With in_memory caching, accessing the same index twice should return the same object reference. Using
assertIs(result1, result2)would verify the cache is truly being used.monai/data/dataset.py (2)
417-422: Static analysis hint: return in try block.Ruff TRY300 suggests moving the
returnto anelseblock. Current code is correct, but restructuring would improve clarity by separating success path from exception handling.Optional refactor
if hashfile is not None and hashfile.is_file(): # cache hit try: _item_transformed = torch.load(hashfile, weights_only=self.weights_only) - if self.in_memory: - self._memory_cache[cache_key] = _item_transformed - return _item_transformed except PermissionError as e: if sys.platform != "win32": raise e except (UnpicklingError, RuntimeError) as e: # corrupt or unloadable cached files are recomputed if "Invalid magic number; corrupt file" in str(e) or isinstance(e, UnpicklingError): warnings.warn(f"Corrupt cache file detected: {hashfile}. Deleting and recomputing.") hashfile.unlink() else: raise e + else: + if self.in_memory: + self._memory_cache[cache_key] = _item_transformed + return _item_transformed
444-461: Potential type inconsistency between disk and memory cache.Line 445 converts to tensor before saving to disk with
convert_to_tensor, but line 460 stores the original_item_transformedin memory. On reload, disk cache returns tensor types, but memory cache may have different types.If consistency matters, store the converted version:
Proposed fix
+ _converted = convert_to_tensor(_item_transformed, convert_numeric=False, track_meta=self.track_meta) try: # NOTE: Writing to a temporary directory and then using a nearly atomic rename operation # to make the cache more robust to manual killing of parent process # which may leave partially written cache files in an incomplete state with tempfile.TemporaryDirectory() as tmpdirname: temp_hash_file = Path(tmpdirname) / hashfile.name torch.save( - obj=convert_to_tensor(_item_transformed, convert_numeric=False, track_meta=self.track_meta), + obj=_converted, f=temp_hash_file, pickle_module=look_up_option(self.pickle_module, SUPPORTED_PICKLE_MOD), pickle_protocol=self.pickle_protocol, ) if temp_hash_file.is_file() and not hashfile.is_file(): # On Unix, if target exists and is a file, it will be replaced silently if the user has permission. # for more details: https://docs.python.org/3/library/shutil.html#shutil.move. try: shutil.move(str(temp_hash_file), hashfile) except FileExistsError: pass except PermissionError: # project-monai/monai issue #3613 pass if self.in_memory: - self._memory_cache[cache_key] = _item_transformed + self._memory_cache[cache_key] = _converted return _item_transformed
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
monai/data/dataset.pytests/data/test_persistentdataset.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for quality and correctness. Ensure variable names adhere to PEP8 style guides, are sensible and informative in regards to their function, though permitting simple names for loop and comprehension variables. Ensure routine names are meaningful in regards to their function and use verbs, adjectives, and nouns in a semantically appropriate way. Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings. Examine code for logical error or inconsistencies, and suggest what may be changed to addressed these. Suggest any enhancements for code improving efficiency, maintainability, comprehensibility, and correctness. Ensure new or modified definitions will be covered by existing or new unit tests.
Files:
tests/data/test_persistentdataset.pymonai/data/dataset.py
🧬 Code graph analysis (1)
tests/data/test_persistentdataset.py (2)
monai/data/dataset.py (5)
memory_cache_size(311-313)set_data(335-344)set_data(642-648)set_data(875-910)set_data(1114-1128)tests/data/test_gdsdataset.py (1)
_InplaceXform(65-68)
🪛 Ruff (0.14.10)
monai/data/dataset.py
422-422: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: packaging
- GitHub Check: flake8-py3 (codeformat)
- GitHub Check: flake8-py3 (mypy)
- GitHub Check: quick-py3 (windows-latest)
- GitHub Check: build-docs
- GitHub Check: quick-py3 (ubuntu-latest)
- GitHub Check: quick-py3 (macOS-latest)
- GitHub Check: flake8-py3 (pytype)
- GitHub Check: min-dep-pytorch (2.8.0)
- GitHub Check: min-dep-py3 (3.12)
- GitHub Check: min-dep-pytorch (2.5.1)
- GitHub Check: min-dep-pytorch (2.7.1)
- GitHub Check: min-dep-py3 (3.9)
- GitHub Check: min-dep-py3 (3.10)
- GitHub Check: min-dep-os (macOS-latest)
- GitHub Check: min-dep-py3 (3.11)
- GitHub Check: min-dep-os (windows-latest)
- GitHub Check: min-dep-pytorch (2.6.0)
- GitHub Check: min-dep-os (ubuntu-latest)
🔇 Additional comments (7)
tests/data/test_persistentdataset.py (2)
240-254: LGTM.Pure RAM caching test correctly validates memory_cache_size behavior with
cache_dir=None.
256-279: LGTM.Good test coverage for the hybrid caching restart scenario. The use of
Path.glob("*.pt")properly verifies disk cache population.monai/data/dataset.py (5)
235-235: LGTM.New
in_memoryparameter with sensible defaultFalse. Docstring is clear. Initialization is correct.Also applies to: 277-280, 307-308
310-313: LGTM.Simple property with appropriate docstring.
341-341: LGTM.Memory cache correctly cleared on data update.
433-437: LGTM.Pure RAM mode correctly stores transformed item when
cache_dir=None.
413-422: Thread safety consideration for multi-worker DataLoader.
_memory_cacheis a plain dict accessed without synchronization. Withnum_workers > 0in DataLoader, each worker is a separate process with its own memory, so this works. However, if used with thread-based data loading or shared memory patterns, concurrent writes could race.Current usage is safe for typical DataLoader patterns.
| # Verify: Results are correct | ||
| for i, result in enumerate(results): | ||
| self.assertEqual(result, [list(range(i))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion may fail due to transform behavior.
The _InplaceXform modifies data by adding np.pi to data[0] (line 60). The assertion self.assertEqual(result, [list(range(i))]) expects untransformed data, but cached results will have np.pi added.
For i=2, cached result would be [[np.pi, 1]], not [[0, 1]].
Proposed fix
- for i, result in enumerate(results):
- self.assertEqual(result, [list(range(i))])
+ for i, result in enumerate(results):
+ expected = [list(range(i))]
+ if expected[0]:
+ expected[0][0] = expected[0][0] + np.pi
+ self.assertEqual(result, expected)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Verify: Results are correct | |
| for i, result in enumerate(results): | |
| self.assertEqual(result, [list(range(i))]) | |
| # Verify: Results are correct | |
| for i, result in enumerate(results): | |
| expected = [list(range(i))] | |
| if expected[0]: | |
| expected[0][0] = expected[0][0] + np.pi | |
| self.assertEqual(result, expected) |
🤖 Prompt for AI Agents
In @tests/data/test_persistentdataset.py around lines 294 - 296, The test
expects untransformed data but the used transform class _InplaceXform mutates
data by adding np.pi to data[0], so update the assertion in the results loop to
account for that transform: compute the expected row as list(range(i)) then add
np.pi to its first element (or use numpy to build expected and add np.pi to
expected[0,0]) and compare with results using a numeric comparison (e.g.,
np.testing.assert_allclose) instead of plain equality so cached, transformed
values like [[np.pi, 1]] match the expectation.
Summary
Addresses #6753
This PR adds an
in_memoryoption toPersistentDatasetthat combines the benefits of persistent disk storage with fast RAM access:This provides the best of both worlds:
Changes
in_memory: bool = Falseparameter toPersistentDatasetmemory_cache_sizeproperty for inspecting cache state_cachecheck()to compute cache key onceset_data()is calledUsage
Test plan
in_memorycache functionalitycache_dir=None(pure RAM mode)set_data()clears memory cache