Skip to content

Conversation

@tsmathis
Copy link
Collaborator

should just work™️

@tsmathis

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@esoteric-ephemera

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@codecov-commenter

This comment was marked as outdated.

tschaume

This comment was marked as outdated.

tschaume

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@esoteric-ephemera

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@tsmathis
Copy link
Collaborator Author

tsmathis commented Nov 12, 2025

re: the iterating, indexing into the local dataset, etc

I am a little conflicted on what the best route for the python-like implementation/behavior of the local MPDatasets should be. Mainly because as soon as we leave arrow-land we're neutering the performance that can be achieved.

As an example:
Regardless of how we do the iteration behavior, this is dog water:

# doesn't work currently, would have to update iterating to match Aaron's review comment first
>>> tasks = mpr.materials.tasks.search()
>>> non_metallic_r2scan_structures = [
    x.structure 
    for x in tasks 
    if x.output.bandgap > 0 and x.run_type == "r2SCAN"
]

compared to:

>>> import pyarrow.compute as pc
>>> tasks_ds = tasks.pyarrow_dataset
>>> expr = (pc.field(("output", "bandgap")) > 0) & (pc.field("run_type") == "r2SCAN")
>>> non_metallic_r2scan_structures = tasks_ds.to_table(columns=["structure"], filter=expr)

which is sub-second execution on my machine

I am obviously biased on this front since I am comfortable with arrow's usage patterns, not sure if the average client user would be willing to go down that route. Ideally though we should be guiding users towards a "golden path".

@esoteric-ephemera
Copy link
Collaborator

Yeah it's hard to say what's best in this case. We'd probably want to prioritize user experience across endpoints, or just throw a specialized warning for full task retrieval that the return type is different

If pandas is a no-op from parquet (not sure if that's also true for the dataset or just an individual table/array) then that could be a viable alternative? Feel like pandas will be more familiar than arrow datasets

@esoteric-ephemera

This comment was marked as outdated.

@tsmathis

This comment was marked as outdated.

@tsmathis
Copy link
Collaborator Author

@esoteric-ephemera, @tschaume - this is ready for review again. Cleaned up/up to date, etc.

Some refreshers:
perf (non-rigorous):

# w/ deltalake & pyarrow
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents: 100% | <progress_bar> | 1914019/1914019 [07:31<00:00, 4240.33it/s]

454.2317273330045 (seconds)

# w/ mp-api v0.46.0
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents:  36%| <progress_bar> | 513085/1435073 [09:16<20:53, 735.30it/s]
zsh: killed     python
/Users/tsmathis/miniconda3/envs/test_api_pypi/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
  
# dies @ ~35%, :shrug:

local caching (w/ warnings):

>>> tasks = mpr.materials.tasks.search()
mp_api.client.core.client - WARNING - Dataset for tasks already exists at /Users/tsmathis/mp_datasets/parsed/core/tasks, returning existing dataset.
mp_api.client.core.client - INFO - Delete or move existing dataset or re-run search query with MPRester(force_renew=True) to refresh local dataset.

Access controlled dataset compatibility (+ accurate pbar):

commercial_user_key = os.environ.get("BY_C_KEY")
non_commercial_user_key = os.environ.get("BY_NC_KEY")

>>> with MPRester(commercial_user_key) as mpr:
...    by_c_tasks = mpr.materials.tasks.search()
Retrieving CoreTaskDoc documents: 100%| <progress_bar> | 1914019/1914019 [07:35<00:00, 4200.25it/s]
>>> len(by_c_tasks)
1914019
>>> with MPRester(non_commercial_user_key, force_renew=True) as mpr:  # clear local cache
...    by_nc_tasks = mpr.materials.tasks.search()
# well pbar for BY_NC will actually shows count of v7 tasks atm...,
# w/ update to v8 tasks pbar will be accurate for non-commercial users
>>> len(by_nc_tasks)
1797065

Warnings on sub-optimal usage w/ links to docs:

>>> _ = tasks[0]
<stdin>:1: MPDatasetIndexingWarning:
            Pythonic indexing into arrow-based MPDatasets is sub-optimal, consider using
            idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
            docs.materialsproject.org/downloading-data/arrow-datasets

>>> _ = tasks[0:10]
<stdin>:1: MPDatasetSlicingWarning:
                Pythonic slicing of arrow-based MPDatasets is sub-optimal, consider using
                idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
                docs.materialsproject.org/downloading-data/arrow-datasets

>>> for i in tasks:
...     _ = i
...
<stdin>:1: MPDatasetIterationWarning:
                Iterating through arrow-based MPDatasets is sub-optimal, consider using
                idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
                docs.materialsproject.org/downloading-data/arrow-datasets

@tsmathis
Copy link
Collaborator Author

still need to write docs.materialsproject.org/downloading-data/arrow-datasets though...
next on the todos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants