Skip to content

feat: add collections validation and GFQL support#874

Open
lmeyerov wants to merge 30 commits intomasterfrom
feat/collections-support
Open

feat: add collections validation and GFQL support#874
lmeyerov wants to merge 30 commits intomasterfrom
feat/collections-support

Conversation

@lmeyerov
Copy link
Contributor

@lmeyerov lmeyerov commented Dec 29, 2025

Summary

  • Add collections validator with strict/autofix behavior and GFQL wire normalization
  • Expose collections API with validate/warn and plot-time URL param validation
  • Add collections tests and clarify plan template location
  • Move collections helpers to graphistry/collections.py and introduce typed models in graphistry/models/collections.py

Key Features

  • g.collections(...) API for defining subsets via GFQL expressions with priority-based visual encodings
  • Helper constructors graphistry.collection_set(...) and graphistry.collection_intersection(...)
  • Support for showCollections, collectionsGlobalNodeColor, and collectionsGlobalEdgeColor URL params
  • Automatic JSON encoding and GFQL AST/Chain/wire-protocol normalization

Validation Behavior

  • Strict mode: Raises ValueError on invalid input
  • Autofix mode (default): Drops invalid collections with warnings
  • Set collections require an id field (server requirement) - missing IDs are warned/dropped, not auto-generated
  • Intersection collections cross-validate that referenced set IDs exist
  • GFQL parsing uses _wrap_gfql_expr as canonical implementation with precise exception handling

Testing

python -m pytest graphistry/tests/test_collections.py graphistry/tests/test_dataset_id_invalidation.py
./bin/lint.sh
./bin/mypy.sh

Review Comments Addressed

See comment for detailed breakdown of all review comments and their resolutions.

show_collections: Optional[bool] = None,
collections_global_node_color: Optional[str] = None,
collections_global_edge_color: Optional[str] = None,
encode: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the interest of simplifying, would it be difficult to detect pre-encoded strings?

Copy link
Contributor Author

@lmeyerov lmeyerov Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this surprised me: the reason it's like this is bc the base REST api takes AABBCC instead of ~css colors like we started to do elsewhere i believe (js color(xyz) normalizes to rgba('...'), handling '#AABBCC', 'silver', etc)

so maybe fix is to ship better colors in REST, and then this upgrades to that?

Copy link
Contributor

@mj3cheun mj3cheun Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think thats a great idea to ship better colours in REST will add that to the list

i was talking about the encode: bool = True, though, which is If True, JSON-minify and URL-encode collections. Use False for pre-encoded strings.

i was thinking we could just detect pre-encoded strings (which i assume are URL encoded and can be detected by attempting a parse). then we can remove this parameter

Comment on lines 70 to 73
if isinstance(collections, list):
return _coerce_collection_list(collections, validate_mode, warn)
if isinstance(collections, dict):
return _coerce_collection_list(collections, validate_mode, warn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: would just get rid of the _coerce_collection_list function and put the contents in here, would get rid of checking is list is dict etc twice and make easier to read

Comment on lines 321 to 322
if validate_mode == 'autofix':
collection_type = str(collection_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if collection_type was not already a string, im not sure that just typecasting it is going to produce a valid type string given this string must be either set or intersection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually nevermind, just saw the if collection_type not in ('set', 'intersection'): below, im thinking maybe combine this check with the check below, if collection_type is not string its all but guaranteed to fail and hit continue in the section below

@mj3cheun
Copy link
Contributor

general comment, validate_mode = 'autofix' seems more likely to produce invalid output and/or result in a collection getting skipped than otherwise. to me its biggest value is typecasting stuff like id=1 to id="1" etc. not sure how much value there is in making it explicit, maybe we just have warn true/false and otherwise handle things silently?

@lmeyerov
Copy link
Contributor Author

hmm, if they do something like g.plot(validate_mode='autofix'), what do we want to happen , or not happen, w/ collections?

that was mostly added b/c people keep having dirty data that fails arrow conversion and rather seeing it load vs fixing their data/cfg

return raw
return raw

return _normalize_ops_list(_extract_ops_value(gfql_ops))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel this could be simplified by just feeding the GFQL into graphistry.compute.chain and seeing if it parses properly rather than setting up stuff here to determine if GFQL is valid or not

@mj3cheun
Copy link
Contributor

mj3cheun commented Jan 12, 2026

@lmeyerov regarding general comment above: im thinking maybe instead of validate we call it strict with "true" or "false" where strict true will throw errors and false will pass over non-compliant collections

autofix sorta implies to me that we are "correcting" the data when invalid but we arent really i think? more just passing it over if there are any mistakes

if we feel we need to type cast we might want to just do it anyway strict true or not

@lmeyerov
Copy link
Contributor Author

lmeyerov commented Jan 12, 2026

the issue with strict true/false is that's closer to what we had before, and users were complaining that they just wanted it to 'work', hence autofix (coerce++). validate=true/false is closer to what you're thinking, while 'autofix' (coerce) is "it'll run, but may not be what you want, but you said wanted soemthing that runs"

We therefore have leeway in what autofix does --- we just need to warn (if warn=auto/true), and do what. So the q is... what should it do wrt diff collections errors, if strong opinions about any params in any direction?

My default intuition is probably:

  • drop collections with invalid gfql
  • colors: random-but-deterministic? or neutral grey / transparent? (default black looks buggy)
  • others: ??? disable / see if there's default values ?

@mj3cheun
Copy link
Contributor

mj3cheun commented Jan 12, 2026

agree with that, i think we do the 3 bullets listed if strict (or whatever we want to call it) is false and its almost what we are doing right now. theres only 1 point i would make about the current implementation

  • instead of just trying to typecast stuff (which a lot of the time doesnt work), i would prefer we do default values or in the case of colours, random colours

in summary its only the approach i think might need to change, not the intention

EDIT: if we do the above it actually gets closer to a true autofix than what it was before, so i dont have an issue with the name anymore

@lmeyerov lmeyerov force-pushed the feat/collections-support branch from 3cf9e9d to 41a33c4 Compare January 16, 2026 17:48
Copy link
Contributor

@aucahuasi aucahuasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, I left some comments and I think we need to solve this feedback before merge:
https://github.com/graphistry/pygraphistry/pull/874/changes#r2778233409

- Remove dead code after return in _normalize_intersection_expr
- Add empty sets list validation in _normalize_sets_list
- Remove pointless type coercion (str() won't produce valid types)
- Consolidate GFQL parsing: _normalize_gfql_ops calls _wrap_gfql_expr
- Add cross-validation for intersection set ID references
- Require id field for sets (server requires it, no auto-generation)
- Use precise exception handling (TypeError, ValueError, GFQLValidationError)
- Update tests to include required id fields

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lmeyerov
Copy link
Contributor Author

lmeyerov commented Feb 8, 2026

Review Comments Addressed (commits 23d1f00, d2bdf02, 47d87e8)

HIGH Priority - Fixed

Comment Author Issue Resolution
#2778239385 @aucahuasi Dead code after return (line 232) ✅ Removed unreachable code
#2778238588 @aucahuasi Validate len(sets_list) > 0 ✅ Added empty sets validation in _normalize_sets_list
#2778244203 @aucahuasi Cross-validate intersection set IDs ✅ Added _validate_intersection_references() post-loop validation
#2778233409 @aucahuasi Consolidate GFQL parsing with collections.py _normalize_gfql_ops now calls _wrap_gfql_expr from collections.py
#2778241945 @aucahuasi Validate set['id'] is required ✅ Added validation warning; drops in autofix mode (no auto-generation - user must provide meaningful IDs)
#2683481643 @mj3cheun Type string coercion pointless ✅ Removed str(collection_type) - now continues/fails since coercion won't produce valid types

MEDIUM Priority - Fixed

Comment Author Issue Resolution
#2683617536 @mj3cheun Use Chain.from_json for GFQL parsing ✅ Now uses _wrap_gfql_expr which handles Chain/AST conversion canonically
#2683627736 @mj3cheun Dead if statement (intersection type check) ✅ Actually NOT dead - validates expr.type vs collection.type mismatch; kept for safety

LOW Priority - Fixed

Comment Author Issue Resolution
#2683343387 @mj3cheun Redundant normalize_validation_params call Keep for API safety - ensures consistent param handling regardless of caller
#2683393920 @mj3cheun Inline _coerce_collection_list ✅ Inlined into _parse_collections_input

Still Open / Deferred

Comment Author Issue Status
#2683317737 @mj3cheun Auto-detect pre-encoded strings Deferred - encode param was removed; always canonicalizes now
#2669705397 @lmeyerov "back out" plot-time validation Kept - acts as safety net for bypassing .collections() method

Key Design Decisions

  1. No ID auto-generation: Server requires IDs but client should not generate arbitrary ones like set_0. Collections without IDs are dropped in autofix mode with a warning. Users must provide meaningful IDs.

  2. Precise exception handling: Changed except Exception to except (TypeError, ValueError, GFQLValidationError) to avoid masking bugs.

  3. Cross-validation for intersections: New _validate_intersection_references() ensures intersection sets reference actual collection IDs. Dangling references are caught before hitting the backend.

  4. GFQL parsing consolidation: _normalize_gfql_ops now delegates to _wrap_gfql_expr from collections.py as the canonical implementation.

  5. Proper type handling: Used dict(entry) to convert TypedDicts to plain dicts at runtime for uniform handling, avoiding unchecked static casts.


CI Status: All jobs green (manual workflow_dispatch run 21793451172) - python-lint-types 3.8-3.14 ✅, all test suites ✅

lmeyerov and others added 3 commits February 7, 2026 20:13
- Set collections require id field (no auto-generation)
- Intersection cross-validation for set ID references
- GFQL parsing consolidation with precise exception handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review comment #2683393920 - removes unnecessary helper function.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CollectionsInput includes TypedDicts which need to be cast to Dict[str, Any].

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the feat/collections-support branch 2 times, most recently from 9cc4506 to 47d87e8 Compare February 8, 2026 06:02
lmeyerov and others added 3 commits February 7, 2026 23:43
Server requires IDs for all collection types (used as storage keys).
In autofix mode, generate `{type}_{idx}` IDs when missing instead of
dropping the collection. This makes simple use cases "just work" while
still warning about the missing ID.

- Sets get `set_0`, `set_1`, etc.
- Intersections get `intersection_0`, `intersection_1`, etc.
- Strict mode still rejects missing IDs
- Added test for auto-generation behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change from `set_0` to `set-0` for consistency with other ID patterns
in the codebase (e.g., kepler's `dataset-{uuid[:8]}`).

User-provided IDs can be any string - no validation beyond type check.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add normalize_gfql_to_wire() as canonical GFQL→wire implementation
- Simplify collections.py (removed 36 lines of duplicate logic)
- Simplify validate_collections.py to call compute/ast
- Clean import graph: models → compute → helpers/validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@aucahuasi aucahuasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmeyerov and others added 2 commits February 10, 2026 15:36
…intersections)

- Allow intersections to reference other intersections (backend supports this)
- Detect self-references (intersection referencing itself)
- Detect cycles in intersection dependencies (A->B->A)
- Add tests for DAG validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
AST class-level from_json methods (ASTLet, ASTRemoteGraph, ASTRef, ASTCall)
use bare `assert` for required field checks. These raise AssertionError
which was not caught by the validation boundary, causing raw crashes
instead of proper validation errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@lmeyerov
Copy link
Contributor Author

Thanks for the thorough reviews! All comments have been addressed:

Consolidation & Architecture:

  • Moved GFQL normalization to compute/ast.py:normalize_gfql_to_wire() as canonical implementation
  • Clean import graph: models → compute → helpers/validation
  • collections.py and validate_collections.py both call compute/ast

Validation Improvements:

  • Intersection DAG validation: supports intersections-of-intersections, detects self-references and cycles
  • Added AssertionError to exception handling (AST from_json methods use bare asserts)
  • Empty sets list validation
  • Cross-validation of intersection set ID references
  • Auto-generate kebab-case IDs in autofix mode (set-0, intersection-1)

Code Quality:

  • Removed dead code after return statements
  • Proper type handling with dict(entry) instead of unchecked casts
  • 22 tests covering all validation paths

CI is green, ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants