Skip to content

Conversation

@Martozar
Copy link
Contributor

@Martozar Martozar commented Sep 2, 2025

Add optimized flag to DataFrameFactory to enable memory-optimized conversion of execution response to pandas dataframe. Without the flag, the conversion will run as usual, storing headers as a list of dictionaries. The optimized version only stores unique headers and references them, thereby preventing unnecessary memory allocations when processing large numbers of duplicated headers. On large datasets, optimized conversation consumes up to 10x less memory (e.g. on 1M rows with 4 attributes original implementation consumed almost 2Gb, while optimized not more than 200Mb)

Note that the new behaviour is optional and turned off by default, so no existing usages should be affected.

JIRA: CQ-1579
risk: low

LabelOverrides = dict[str, dict[str, dict[str, str]]]


class _Header(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should also have __slots__ = () so as not to pollute class hierarchy.

see: https://stackoverflow.com/questions/1816483/how-does-inheritance-of-slots-in-subclasses-actually-work

Comment on lines 127 to 129
_unique_headers: list[_Header] = field(factory=list)
_header_to_index: dict[_Header, int] = field(factory=dict)
_indexes: list[int] = field(factory=list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this isn't unnecessarily complicated? in the end, the goal is to reuse _Header instances right? why not having a mapping dict[_Header, _Header] and ditch the indexes altogether?

Comment on lines 127 to 129

_headers: list[_Header] = field(factory=list)
_header_to_header: dict[_Header, _Header] = field(factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd rename this. _header_cache or _header_dedup or something like that. the header to header sounds like a name of some pop band :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed :)

Comment on lines 145 to 146
h2h = self._header_to_header
headers = self._headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest to ditch these and use the self fields directly.

…o dataframe

Add `optimized` flag to DataFrameFactory to enable memory-optimized conversion of execution response to pandas dataframe. Without the flag, the conversion will run as usual, storing headers as a list of dictionaries. The optimized version only stores unique headers and reference them, preventing unnecessary memory allocations when lots of duplicated headers are processed. Note that the new behaviour is optional and turned off by default, so no existing usages should be affected.

JIRA: CQ-1579
risk: low
Copy link
Contributor

@lupko lupko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Martozar Martozar merged commit 17dfca8 into gooddata:master Sep 2, 2025
9 checks passed
@Martozar Martozar deleted the c.mze-CQ-1579 branch September 2, 2025 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants