[SPARK-55304][SS][PYTHON] Introduce support of Admission Control and Trigger.AvailableNow in Python data source - streaming reader by HeartSaVioR · Pull Request #54085 · apache/spark

HeartSaVioR · 2026-02-02T01:19:45Z

What changes were proposed in this pull request?

This PR proposes to introduce the support of Admission Control and Trigger.AvailableNow in Python data source - streaming reader.

To support Admission control, we propose to change DataSourceStreamReader interface as following:
(Created a table to perform side-by-side comparison)

Before	After
`class DataSourceStreamReader(ABC):`	`class DataSourceStreamReader(ABC):`
`def initialOffset(self) -> dict`	`def initialOffset(self) -> dict`
`def latestOffset() -> dict`	`def latestOffset(self, start: dict, limit: ReadLimit) -> dict`
	`# NOTE: Optional to implement, default = ReadAllAvailable()`
	`def getDefaultReadLimit(self) -> ReadLimit`
	`# NOTE: Optional to implement, default = None`
	`def reportLatestOffset(self) -> Optional[dict]`
`def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]`	`def partitions(self, start: dict, end: dict) -> Sequence[InputPartition]`
`@abstractmethod def read(self, partition: InputPartition) -> Union[Iterator[Tuple], Iterator["RecordBatch"]]`	`@abstractmethod def read(self, partition: InputPartition) -> Union[Iterator[Tuple], Iterator["RecordBatch"]]`
`def commit(self, end: dict) -> None`	`def commit(self, end: dict) -> None`
`def stop(self) -> None`	`def stop(self) -> None`

The main change is following:

The method signature for latestOffset is changed. The method is mandatory.
The method getDefaultReadLimit is added, as optional.
The method reportLatestOffset is added, as optional.

This way, new implementations would support Admission Control by default. We ensure the engine can handle the case of the old method signature, via Python’s built-in inspect module (similar to Java’s reflection). If the method “latestOffset” is implemented without parameters, we fall back to the source which does not enable admission control. For all new sources, implementing latestOffset with parameters is strongly recommended.

ReadLimit interface and built-in implementations will be available for source implementations to leverage. Built-in implementations are as follows: ReadAllAvailable, ReadMinRows, ReadMaxRows, ReadMaxFiles, ReadMaxBytes. We won’t support custom implementation of ReadLimit interface at this point since it requires major efforts and we don’t see a demand, but we can plan for it if there is a strong demand.

We do not make any change to SimpleDataSourceStreamReader for Admission Control, since it is designed for small data fetch and could be considered as already limiting the data. We could still add the ReadLimit later if we see strong demand of limiting the fetch size via the source option.

To support Trigger.AvailableNow, we propose to introduce a new interface as following:

class SupportsTriggerAvailableNow(ABC):
  @abstractmethod
  def prepareForTriggerAvailableNow(self) -> None

The above interface can be “mixed-up” with both DataSourceStreamReader and SimpleDataSourceStreamReader. It won’t work with DataSourceStreamReader implementations having the old method signature of latestOffset(), likewise mentioned above.

Why are the changes needed?

This is to catch up with supported features in Scala DSv2 API, which we got reports from developers that missing features block them to implement some data sources.

Does this PR introduce any user-facing change?

Yes, users implementing streaming reader via python data source API will be able to add the support of Admission Control and Trigger.AvailableNow, which had been major lacks of features.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

Co-authored using claude-4.5-sonnet

github-actions · 2026-02-02T01:19:56Z

JIRA Issue Information

=== New Feature SPARK-55304 ===
Summary: Introduce Admission Control and Trigger.AvailableNow into Python Data Source - reader
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

… built-in read limit, testing...

huanliwang-db · 2026-02-03T06:43:55Z

python/pyspark/sql/streaming/datasource.py

+        same offset as given start parameter, to indicate that there is no more data to read. This
+        includes the case where the query is restarted and the source is asked to read from the


sorry, why do we need to handle query restart case here? shouldn't the trigger available now get the new end offset after query restart?

4575b1d

So here is the scenario - the query reads from Kafka topic. In the first run, the topic has 3 partitions. During the downtime of the query, users perform repartition of Kafka topic and now the topic has 5 partitions. If there was uncommitted batch, the second run of query will get the start offset from uncommitted batch, which had only 3 partitions. In the meanwhile, prepareForTriggerAvailableNow() will identify there are 5 partitions and store the offset for 5 partitions. The source is responsible to read further from 3 partitions and figure out new partitions, and eventually touch the offset stored from prepareForTriggerAvailableNow().

The scenario is actually complicated and I might not be able to describe the case with ease of understanding. If there is proposal for better wording, I appreciate the suggestion!

thanks for the explanation, the example sounds good to me

huanliwang-db · 2026-02-03T06:54:39Z

python/pyspark/sql/datasource.py

+        the very first micro-batch, and the offset continues from the last micro-batch for the
+        following. The source can return the same offset as start offset if there is no data to


maybe "and for subsequent micro-batches, the start offset is the ending offset from the previous micro-batch." is better iirc?

I'm slightly not in favor of coupling the contract with the engine's behavior, since it could limit ourselves. (I found myself doing this and I'm even OK with omitting it.) But the explanation from your comment is unlikely to change, so makes sense to me. Thanks for the suggestion.

huanliwang-db · 2026-02-03T06:56:08Z

python/pyspark/sql/datasource.py

+        the very first micro-batch, and the offset continues from the last micro-batch for the
+        following. The source can return the same offset as start offset if there is no data to


"The source can return the same offset as start offset if there is no data to process"

I feel this is also a bit confusing, which "start" offset you are referring here?

I meant the parameter.

"The source can return the start parameter as it is, if there is no data to process"

^ Would it be clearer?

huanliwang-db · 2026-02-03T07:00:24Z

python/pyspark/sql/datasource.py

+        engine; e.g. if the readLimit is :class:`ReadAllAvailable`, the source must ignore the read
+        limit configured through options.


nit: maybe it will be more clear if we can provide an example in which the engine provides a different read limit than the configured one

It's for two cases 1) Trigger.Once (deprecated), 2) fallback of Trigger.AvailableNow (when any stream does not support Trigger.AvailableNow and Trigger.AvailableNow is requested) - I'm not very sure we would like to document these cases, as once we document it's considered as "contract" rather than implementation detail.

huanliwang-db · 2026-02-03T17:46:04Z

python/pyspark/sql/streaming/datasource.py

+        same offset as given start parameter, to indicate that there is no more data to read. This
+        includes the case where the query is restarted and the source is asked to read from the


thanks for the explanation, the example sounds good to me

python/pyspark/sql/datasource_internal.py

huanliwang-db · 2026-02-03T18:54:35Z

python/pyspark/sql/datasource_internal.py

+        self._registry[type_name] = read_limit_type
+
+    def get(self, type_name: str, params: dict) -> ReadLimit:
+        read_limit_type = self._registry[type_name]


I am not quite familiar with python, but will this throws out exception if type_name doesn't exist?

+1 here, it will throw a KeyError I think. Either way this should probably use self._registry._get(type_name) or have an appropriate check

That's not expected to happen, but KeyError is definitely not preferable. I'll probably throw the better exception - if we have internal exception in PySpark than I'll use that.

lol I have check with None at next line. I'll just do .get().

huanliwang-db · 2026-02-03T18:56:47Z

...t/scala/org/apache/spark/sql/execution/python/streaming/PythonStreamingDataSourceSuite.scala

is it possible to add a test case where latestOffset returns the same offset?

It happens with Trigger.AvailableNow; I'll check whether we do not have one for processing time timer; maybe that needs some union or stream-stream join since we still need to trigger the microbatch to verify the behavior. (If latestOffset returns the same offset and it's the only stream, we don't trigger the microbatch.)

huanliwang-db

LGTM

yyoli-db

Thanks for doing this!

allisonwang-db

Thanks for working on this! Left some comments.

allisonwang-db · 2026-02-04T22:49:57Z

python/pyspark/sql/streaming/datasource.py

+    Specifies limits on how much data to read from a streaming source when
+    determining the latest offset.


Can we add more comments and examples on how to use this class?

The documentation should be covered in latestOffset() and getDefaultReadLimit() in DataSourceStreamReader.

Since devs have to use the built-in read limit implementations at this point, I'm going to enumerate the built-in implementations here so that they can understand which classes are available.

allisonwang-db · 2026-02-04T22:50:26Z

python/pyspark/sql/streaming/datasource.py

+        Parameter
+        ---------
+        params : dict
+            The parameters to create the :class:`ReadLimit`. type name isn't included.


ditto can we add more examples?

Class methods in ReadLimit aren't something user facing ones. They are Spark internal ones. Do we still need examples?

allisonwang-db · 2026-02-04T23:53:29Z

python/pyspark/sql/datasource_internal.py

+        assert isinstance(
+            limit, ReadAllAvailable
+        ), "simple stream reader does not support read limit"


Does that mean we can't use availableNow with simple streaming reader?

No, admission control is not available for simple streaming reader. Trigger.availableNow would be still available for simple stream reader.

allisonwang-db · 2026-02-04T23:54:41Z

python/pyspark/sql/datasource.py

+        ----------
+        start : dict
+            The start offset of the microbatch to continue reading from.
+        limit : :class:`ReadLimit`
+            The limit on the amount of data to be returned by this call.


Let's add supported version here

allisonwang-db · 2026-02-04T23:55:03Z

python/pyspark/sql/datasource.py

+        NOTE: Previous Spark versions didn't have start offset and read limit parameters for this
+        method. While Spark will ensure the backward compatibility for existing data sources, the
+        new data sources are strongly encouraged to implement this new method signature.


Let's add this NOTE below as a docstring section?

You meant moving out of doc comment? I'm OK with it.

allisonwang-db · 2026-02-05T00:47:03Z

cc @shujingyang-db @gaogaotiantian

gaogaotiantian · 2026-02-05T07:07:07Z

python/pyspark/sql/streaming/datasource.py

+from abc import ABC, abstractmethod
+
+
+class ReadLimit(ABC):


I think this ABC is over-designed, especially considering that we do not even support custom ReadLimit class at this point.

type_name does not really do anything besides returning an identifier for this class. It's only used internally I think you can directly use self.__class__.__name__.

cls.load(param) -> cls is not a super common pattern in python or pyspark. This is just __init__ I think. You just need the subclass to have an __init__ function that takes a parameter. Save that parameter and use it in dump would be fine.

Bottom line is, you just need a serializable enum that take some argument. I think you can totally do it with just dataclass.

from dataclasses import dataclass class ReadLimit: ... @dataclass class ReadAllAvailable(ReadLimit): type: str = "ReadAllAvailable" @dataclass class ReadMinRows(ReadLimit): type: str = "ReadMinRows" min_rows: int

You can do dataclasses.asdict(obj) to dump it and registry[type](**params) to create the class. What's the concerns to use this simple pattern?

It might be the case, but when we decide to extend this, would it be a one way door and we will never be able to support user-defined ReadLimit? If not I'm happy to incorporate the feedback.

OK looks like it's still retaining the mechanism of registry but just to remove out the classmethods. Maybe we could just ask the custom impl to provide the type and done. OK for me.

Yeah there could still be custom RateLimit in the future. The difference is that the user only needs to define a dataclass like

@dataclass class MyOwnRateLimiter(RateLimit): name = "MyOwnRateLimiter" param1: int param2: str

And that's it - they don't need to do anything else.

I changed type to name here because I think that's probably a bit more intuitive. Also type is a global variable so even though it's okay to use it here, I try to avoid it.

Also if you think it's easier for them to do

@dataclass class MyOwnRateLimiter(RateLimit): param1: int param2: str

All you need to do is to do a bit extra work when you dump it (not even load). When you dump it, do something like

asdict(r) | {"_type": r.__class__.__name__} - that saved another thing you need to specify in the dataclass.

I'll take the class name - basically I want flexibility of the name, but that sounds to me I need to do it with class variable and that is even modifiable from outside, which could mess up the thing. I realized Final is just a type hint.

gaogaotiantian · 2026-02-05T07:15:09Z

python/pyspark/sql/streaming/python_streaming_source_runner.py

 NON_EMPTY_PYARROW_RECORD_BATCHES = 1
 EMPTY_PYARROW_RECORD_BATCHES = 2

+SUPPORTS_ADMISSION_CONTROL = 1


SUPPORTS_ADMISSION_CONTROL = 1 << 0 is better I think.

gaogaotiantian · 2026-02-05T07:26:39Z

python/pyspark/sql/streaming/python_streaming_source_runner.py


+def check_support_func(reader: DataSourceStreamReader, outfile: IO) -> None:
+    support_flags = 0
+    if isinstance(reader, _SimpleStreamReaderWrapper):


I'm surprised that we wrote _SimpleStreamReaderWrapper which is a subclass of DataSourceStreamReader yet we still need to have a separate if statement for it - that to be is against the rule of inheritance...

However, in this case, do we really need to?

The same logic for inspect.signature applies for _SimpleStreamReaderWrapper because it does have the correct signature for reader.latestOffset. SupportsTriggerAvailableNow is not the important thing the important thing should be prepareForTriggerAvailableNow. We should have a read-through logic for _SimpleStreamReaderWrapper so hasattr(reader, "prepareForTriggerAvailableNow") returns the underlying simple_reader's attributes directly.

Maybe we don't need to do this in this PR, but we should not claim that _SimpleStreamReaderWrapper is a DataSourceStreamReader and still need to have a separate case to access simple_reader all the time.

While I get how your proposal works, I'd argue that this is just another workaround for doing something which we can't do from inheritance.

For example, you said SupportsTriggerAvailableNow isn't the important thing - if we come to non-python language which has strong typing, explicitly checking the type is mostly required except the hacky way like reflection in Java - it is not a best practice. We do inspect for latestOffset but this is a last resort approach, because python does not support method overloading so we actually can't do the same evolution of API we did from Scala/Java.
(It's still possible we can introduce a separate interface to achieve similar thing but it is not about overloading - it's just that the latest definition will just override the prior definition for the same name of method.)

Though I understand that in python it's duck typing and the type system is crazy loose and the availability of function/method simply matters.

Maybe you might have tried to give a better approach for pythonic thinking. It's just that both breaks inheritance and the proposal seems to break more than what I do. I agree if you want to say this isn't pythonic, but then I could say this is one of the pattern we heavily leverage in Scala, we call it as pattern matching. It's just that there is no language level support so I have to do this manually.

we should not claim that _SimpleStreamReaderWrapper is a DataSourceStreamReader

This breaks the structure of interface we design. If _SimpleStreamReaderWrapper isn't a DataSourceStreamReader, what it would be?

gaogaotiantian

I understand that there's a time pressure. I think as long as we can agree on the public interface, we can always polish the implementation later.

I have no knowledge on scala side so I just have some comments on python code.

HeartSaVioR · 2026-02-05T08:56:28Z

I think the proposal on the interface is closer to be "arguable" instead of something I totally agree. While it might be more pythonic, at least we shouldn't claim that is better for inheritance. From the proposal I don't see we respect the interface at all. I don't think this can hold the PR from merging anyway since it's more about "preference" at this point - that's not related to time pressure or so.

Also _SimpleStreamReaderWrapper is an internal implementation of DataSourceStreamReader, so modifying that class is nothing to do with "public interface".

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Feb 2, 2026

HeartSaVioR added 23 commits February 2, 2026 11:43

WIP python data source Trigger.AvailableNow

f582584

add e2e test for simple stream data reader & availableNow

9c47c1e

address admission control and trigger availableNow for stream reader,…

19a7f0b

… built-in read limit, testing...

fix style

7747e03

fix test

1366237

fix

5d3415c

Melt SupportsAdmissionControl into stream reader

83f089d

style fix

8ac771a

abstractmethod is not needed

183ce55

doc update

b1ca26f

fix

6e0a22a

fix lint

bb055ee

enhance python side tests

a4f8fea

fix silly bugs in test (claude 4.5 sonnet)

def800a

fix

806cbcd

fix...

4568884

better error on verification failure in test

cc4d4be

style fix

84edd33

fix

d5ad102

fix...

d373f00

fix the issue of test failure - claude-4.5-sonnet

919ae6b

trigger CI

f8fa69d

change param name

4fc6e0a

HeartSaVioR force-pushed the SPARK-55304 branch from 157cad4 to 4fc6e0a Compare February 2, 2026 02:43

HeartSaVioR added 2 commits February 2, 2026 21:16

fix lint

256350d

fix

90bd289

Sharing credit for similar contribution in other PR

33b31fc

huanliwang-db reviewed Feb 3, 2026

View reviewed changes

review comment

4dc1a8e

huanliwang-db reviewed Feb 3, 2026

View reviewed changes

HeartSaVioR added 2 commits February 4, 2026 03:59

add test for empty batch

853e29a

reflect review comments

8033305

huanliwang-db approved these changes Feb 4, 2026

View reviewed changes

yyoli-db approved these changes Feb 4, 2026

View reviewed changes

empty commit to trigger CI

68be699

allisonwang-db reviewed Feb 4, 2026

View reviewed changes

reflect review comments

51457c4

HeartSaVioR requested a review from allisonwang-db February 5, 2026 04:32

allisonwang-db approved these changes Feb 5, 2026

View reviewed changes

gaogaotiantian reviewed Feb 5, 2026

View reviewed changes

HeartSaVioR force-pushed the SPARK-55304 branch from 5c4de75 to 0118045 Compare February 5, 2026 13:08

reflect review comments, triggering CI to confirm

4bd0bc9

HeartSaVioR force-pushed the SPARK-55304 branch from 0118045 to 4bd0bc9 Compare February 5, 2026 13:22

		same offset as given start parameter, to indicate that there is no more data to read. This
		includes the case where the query is restarted and the source is asked to read from the

		the very first micro-batch, and the offset continues from the last micro-batch for the
		following. The source can return the same offset as start offset if there is no data to

		engine; e.g. if the readLimit is :class:`ReadAllAvailable`, the source must ignore the read
		limit configured through options.

		Specifies limits on how much data to read from a streaming source when
		determining the latest offset.

Conversation

HeartSaVioR commented Feb 2, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 2, 2026

JIRA Issue Information

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huanliwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

yyoli-db left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

HeartSaVioR Feb 3, 2026 •

edited

Loading

HeartSaVioR Feb 4, 2026 •

edited

Loading

HeartSaVioR Feb 5, 2026 •

edited

Loading

HeartSaVioR Feb 5, 2026 •

edited

Loading

HeartSaVioR commented Feb 5, 2026 •

edited

Loading