-
Notifications
You must be signed in to change notification settings - Fork 60
feat: optimize memory allocation when converting execution response to dataframe #1125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| LabelOverrides = dict[str, dict[str, dict[str, str]]] | ||
|
|
||
|
|
||
| class _Header(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should also have __slots__ = () so as not to pollute class hierarchy.
| _unique_headers: list[_Header] = field(factory=list) | ||
| _header_to_index: dict[_Header, int] = field(factory=dict) | ||
| _indexes: list[int] = field(factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this isn't unnecessarily complicated? in the end, the goal is to reuse _Header instances right? why not having a mapping dict[_Header, _Header] and ditch the indexes altogether?
c05773a to
fcceef4
Compare
|
|
||
| _headers: list[_Header] = field(factory=list) | ||
| _header_to_header: dict[_Header, _Header] = field(factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd rename this. _header_cache or _header_dedup or something like that. the header to header sounds like a name of some pop band :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed :)
| h2h = self._header_to_header | ||
| headers = self._headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suggest to ditch these and use the self fields directly.
fcceef4 to
5e3aaf7
Compare
…o dataframe Add `optimized` flag to DataFrameFactory to enable memory-optimized conversion of execution response to pandas dataframe. Without the flag, the conversion will run as usual, storing headers as a list of dictionaries. The optimized version only stores unique headers and reference them, preventing unnecessary memory allocations when lots of duplicated headers are processed. Note that the new behaviour is optional and turned off by default, so no existing usages should be affected. JIRA: CQ-1579 risk: low
5e3aaf7 to
ac0189f
Compare
lupko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add
optimizedflag to DataFrameFactory to enable memory-optimized conversion of execution response to pandas dataframe. Without the flag, the conversion will run as usual, storing headers as a list of dictionaries. The optimized version only stores unique headers and references them, thereby preventing unnecessary memory allocations when processing large numbers of duplicated headers. On large datasets, optimized conversation consumes up to 10x less memory (e.g. on 1M rows with 4 attributes original implementation consumed almost 2Gb, while optimized not more than 200Mb)Note that the new behaviour is optional and turned off by default, so no existing usages should be affected.
JIRA: CQ-1579
risk: low