Add Deltalake query support #1023

tsmathis · 2025-10-23T15:42:27Z

should just work™️

mp_api/client/core/utils.py

tsmathis · 2025-11-12T23:02:56Z

re: the iterating, indexing into the local dataset, etc

I am a little conflicted on what the best route for the python-like implementation/behavior of the local MPDatasets should be. Mainly because as soon as we leave arrow-land we're neutering the performance that can be achieved.

As an example:
Regardless of how we do the iteration behavior, this is dog water:

# doesn't work currently, would have to update iterating to match Aaron's review comment first
>>> tasks = mpr.materials.tasks.search()
>>> non_metallic_r2scan_structures = [
    x.structure 
    for x in tasks 
    if x.output.bandgap > 0 and x.run_type == "r2SCAN"
]

compared to:

>>> import pyarrow.compute as pc
>>> tasks_ds = tasks.pyarrow_dataset
>>> expr = (pc.field(("output", "bandgap")) > 0) & (pc.field("run_type") == "r2SCAN")
>>> non_metallic_r2scan_structures = tasks_ds.to_table(columns=["structure"], filter=expr)

which is sub-second execution on my machine

I am obviously biased on this front since I am comfortable with arrow's usage patterns, not sure if the average client user would be willing to go down that route. Ideally though we should be guiding users towards a "golden path".

esoteric-ephemera · 2025-11-13T01:31:57Z

Yeah it's hard to say what's best in this case. We'd probably want to prioritize user experience across endpoints, or just throw a specialized warning for full task retrieval that the return type is different

If pandas is a no-op from parquet (not sure if that's also true for the dataset or just an individual table/array) then that could be a viable alternative? Feel like pandas will be more familiar than arrow datasets

tsmathis · 2026-02-10T00:36:15Z

@esoteric-ephemera, @tschaume - this is ready for review again. Cleaned up/up to date, etc.

Some refreshers:
perf (non-rigorous):

# w/ deltalake & pyarrow
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents: 100% | <progress_bar> | 1914019/1914019 [07:31<00:00, 4240.33it/s]

454.2317273330045 (seconds)

# w/ mp-api v0.46.0
>>> timeit.timeit(lambda: mpr.materials.tasks.search(), number=1)
Retrieving CoreTaskDoc documents:  36%| <progress_bar> | 513085/1435073 [09:16<20:53, 735.30it/s]
zsh: killed     python
/Users/tsmathis/miniconda3/envs/test_api_pypi/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
  
# dies @ ~35%, :shrug:

local caching (w/ warnings):

>>> tasks = mpr.materials.tasks.search()
mp_api.client.core.client - WARNING - Dataset for tasks already exists at /Users/tsmathis/mp_datasets/parsed/core/tasks, returning existing dataset.
mp_api.client.core.client - INFO - Delete or move existing dataset or re-run search query with MPRester(force_renew=True) to refresh local dataset.

Access controlled dataset compatibility (+ accurate pbar):

commercial_user_key = os.environ.get("BY_C_KEY")
non_commercial_user_key = os.environ.get("BY_NC_KEY")

>>> with MPRester(commercial_user_key) as mpr:
...    by_c_tasks = mpr.materials.tasks.search()
Retrieving CoreTaskDoc documents: 100%| <progress_bar> | 1914019/1914019 [07:35<00:00, 4200.25it/s]
>>> len(by_c_tasks)
1914019
>>> with MPRester(non_commercial_user_key, force_renew=True) as mpr:  # clear local cache
...    by_nc_tasks = mpr.materials.tasks.search()
# well pbar for BY_NC will actually shows count of v7 tasks atm...,
# w/ update to v8 tasks pbar will be accurate for non-commercial users
>>> len(by_nc_tasks)
1797065

Warnings on sub-optimal usage w/ links to docs:

>>> _ = tasks[0]
<stdin>:1: MPDatasetIndexingWarning:
            Pythonic indexing into arrow-based MPDatasets is sub-optimal, consider using
            idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
            docs.materialsproject.org/downloading-data/arrow-datasets

>>> _ = tasks[0:10]
<stdin>:1: MPDatasetSlicingWarning:
                Pythonic slicing of arrow-based MPDatasets is sub-optimal, consider using
                idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
                docs.materialsproject.org/downloading-data/arrow-datasets

>>> for i in tasks:
...     _ = i
...
<stdin>:1: MPDatasetIterationWarning:
                Iterating through arrow-based MPDatasets is sub-optimal, consider using
                idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
                docs.materialsproject.org/downloading-data/arrow-datasets

tsmathis · 2026-02-10T00:38:07Z

still need to write docs.materialsproject.org/downloading-data/arrow-datasets though...
next on the todos

tschaume and others added 3 commits October 22, 2025 13:12

exclude gnome for full downloads if needed

9d2048e

query s3 for trajectories

505ddfe

add deltalake query support

aee0f8c

This comment was marked as outdated.

Sign in to view

linting + mistaken sed replace on 'where'

d5a25b1

tsmathis force-pushed the deltalake branch from 8c59af4 to d5a25b1 Compare October 23, 2025 16:04

This comment was marked as outdated.

Sign in to view

tschaume mentioned this pull request Oct 23, 2025

exclude gnome for full downloads if needed #974

Closed

This comment was marked as outdated.

Sign in to view

tsmathis added 3 commits October 23, 2025 10:40

return trajectory as pmg dict

2de051d

update trajectory test

7d0b8b7

correct docstrs

7195adf

This comment was marked as outdated.

Sign in to view

tsmathis mentioned this pull request Oct 23, 2025

add BatchIDQuery to tasks_resource materialsproject/emmet#1330

Merged

tschaume and others added 4 commits October 24, 2025 12:48

Merge branch 'main' into deltalake

33b787f

get access controlled batch ids from heartbeat

2664fcd

refactor

b498a76

Merge branch 'main' into deltalake

7da6984

This comment was marked as outdated.

Sign in to view

github-actions and others added 5 commits November 4, 2025 16:05

auto dependency upgrades

948c108

Update testing.yml

b0aed4f

rm overlooked access of removed settings param

a35bcb7

refactor: consolidate requests to heartbeat for meta info

9460601

lint

05f1d0e

This comment was marked as outdated.

Sign in to view

tsmathis added 2 commits November 5, 2025 10:30

fix incomplete docstr

e685445

typo

bb0b238

This comment was marked as outdated.

Sign in to view

esoteric-ephemera reviewed Nov 12, 2025

View reviewed changes

mp_api/client/core/utils.py Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

tsmathis added 3 commits November 12, 2025 10:36

more resilient dataset path expansion

7ee5515

missed field annotation update

ae7674d

coerce Path to str for deltalake lib

5538c74

This comment was marked as outdated.

Sign in to view

tsmathis added 5 commits November 13, 2025 17:22

flush based on bytes

f39c0d3

iterate over individual rows for local dataset

a965255

missed bounds check for updated iteration behavior

03b38e7

opt for module level logging over warnings lib

3a44b4f

lint

b2a832f

This comment was marked as outdated.

Sign in to view

tsmathis and others added 7 commits February 9, 2026 10:56

Merge branch 'main' into deltalake

4b4af48

missed during merge-conflict resolution

9cf0713

bump deltalake

ff17bea

explicit casts for arrow types for data read from delta

cd6e4a4

auto dependency upgrades

0cf6a40

raise warnings for pythonic usage of MPDatasets

7284d74

Automated dependency upgrades (#1058)

961e21c

tsmathis force-pushed the deltalake branch from dbc27fd to 961e21c Compare February 9, 2026 22:56

tsmathis added 2 commits February 9, 2026 15:00

incomplete docstr for MPDataset

e09fd48

fix get_trajectory helper func + test

92f88ac

tsmathis requested review from esoteric-ephemera and tschaume February 10, 2026 00:36

missed passing mpdataset kwargs to lazy subresters on init

551e448

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Deltalake query support #1023

Add Deltalake query support #1023

Uh oh!

tsmathis commented Oct 23, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

tsmathis commented Nov 12, 2025 •

edited

Loading

Uh oh!

esoteric-ephemera commented Nov 13, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

tsmathis commented Feb 10, 2026

Uh oh!

tsmathis commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Deltalake query support #1023

Are you sure you want to change the base?

Add Deltalake query support #1023

Uh oh!

Conversation

tsmathis commented Oct 23, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

tsmathis commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esoteric-ephemera commented Nov 13, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

tsmathis commented Feb 10, 2026

Uh oh!

tsmathis commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tsmathis commented Nov 12, 2025 •

edited

Loading