-
Notifications
You must be signed in to change notification settings - Fork 52
Add Deltalake query support #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
re: the iterating, indexing into the local dataset, etc I am a little conflicted on what the best route for the python-like implementation/behavior of the local As an example: # doesn't work currently, would have to update iterating to match Aaron's review comment first
>>> tasks = mpr.materials.tasks.search()
>>> non_metallic_r2scan_structures = [
x.structure
for x in tasks
if x.output.bandgap > 0 and x.run_type == "r2SCAN"
]compared to: >>> import pyarrow.compute as pc
>>> tasks_ds = tasks.pyarrow_dataset
>>> expr = (pc.field(("output", "bandgap")) > 0) & (pc.field("run_type") == "r2SCAN")
>>> non_metallic_r2scan_structures = tasks_ds.to_table(columns=["structure"], filter=expr)which is sub-second execution on my machine I am obviously biased on this front since I am comfortable with arrow's usage patterns, not sure if the average client user would be willing to go down that route. Ideally though we should be guiding users towards a "golden path". |
|
Yeah it's hard to say what's best in this case. We'd probably want to prioritize user experience across endpoints, or just throw a specialized warning for full task retrieval that the return type is different If |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
@esoteric-ephemera, @tschaume - this is ready for review again. Cleaned up/up to date, etc. Some refreshers: # w/ deltalake & pyarrow
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents: 100% | <progress_bar> | 1914019/1914019 [07:31<00:00, 4240.33it/s]
454.2317273330045 (seconds)
# w/ mp-api v0.46.0
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents: 36%| <progress_bar> | 513085/1435073 [09:16<20:53, 735.30it/s]
zsh: killed python
/Users/tsmathis/miniconda3/envs/test_api_pypi/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
# dies @ ~35%, :shrug:local caching (w/ warnings): >>> tasks = mpr.materials.tasks.search()
mp_api.client.core.client - WARNING - Dataset for tasks already exists at /Users/tsmathis/mp_datasets/parsed/core/tasks, returning existing dataset.
mp_api.client.core.client - INFO - Delete or move existing dataset or re-run search query with MPRester(force_renew=True) to refresh local dataset.Access controlled dataset compatibility (+ accurate commercial_user_key = os.environ.get("BY_C_KEY")
non_commercial_user_key = os.environ.get("BY_NC_KEY")
>>> with MPRester(commercial_user_key) as mpr:
... by_c_tasks = mpr.materials.tasks.search()
Retrieving CoreTaskDoc documents: 100%| <progress_bar> | 1914019/1914019 [07:35<00:00, 4200.25it/s]
>>> len(by_c_tasks)
1914019
>>> with MPRester(non_commercial_user_key, force_renew=True) as mpr: # clear local cache
... by_nc_tasks = mpr.materials.tasks.search()
# well pbar for BY_NC will actually shows count of v7 tasks atm...,
# w/ update to v8 tasks pbar will be accurate for non-commercial users
>>> len(by_nc_tasks)
1797065Warnings on sub-optimal usage w/ links to docs: >>> _ = tasks[0]
<stdin>:1: MPDatasetIndexingWarning:
Pythonic indexing into arrow-based MPDatasets is sub-optimal, consider using
idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
docs.materialsproject.org/downloading-data/arrow-datasets
>>> _ = tasks[0:10]
<stdin>:1: MPDatasetSlicingWarning:
Pythonic slicing of arrow-based MPDatasets is sub-optimal, consider using
idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
docs.materialsproject.org/downloading-data/arrow-datasets
>>> for i in tasks:
... _ = i
...
<stdin>:1: MPDatasetIterationWarning:
Iterating through arrow-based MPDatasets is sub-optimal, consider using
idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
docs.materialsproject.org/downloading-data/arrow-datasets |
|
still need to write docs.materialsproject.org/downloading-data/arrow-datasets though... |
should just work™️