Conversation
|
@aucahuasi - it is failing for test-full-ai, unsure how to fix that. Known issue? |
| return True | ||
|
|
||
|
|
||
| def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
-
add signature to
Plottableinterface: https://github.com/graphistry/pygraphistry/blob/master/graphistry/Plottable.py -
expose for chained use in
PlotterBase, e.g., :pygraphistry/graphistry/PlotterBase.py
Line 1428 in 320326c
| def check_node_dataframe_exists(g, verbose=True): | ||
| if g._nodes is None: | ||
| if verbose: | ||
| print("Warning: graph was created with only edges. Skipping Node ID check if Node IDs match edge IDs. Use g2 = g.materialize_nodes() to force node df creation. Exiting.") |
There was a problem hiding this comment.
switch from print statements to warn + logger.info depending on use
removes need for verbose param via Pythonic conventions
|
|
||
|
|
||
| def validate_graph(g, verbose=True): | ||
| if not check_node_dataframe_exists(g, verbose): |
There was a problem hiding this comment.
node existence should be more like:
- either _node and _nodes are both defined or neither defined
| return True | ||
|
|
||
|
|
||
| def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
for validators used for debugging, instead of doing an immediate error, good to collect errors and report them all, with failure only at the end
|
|
||
|
|
||
| def check_nan_node_ids(g, verbose=True): | ||
| if g._nodes[g._node].isnull().any(): |
There was a problem hiding this comment.
compute & report .sum() vs just .any()
a more verbose mode might return the df to help save time
|
|
||
|
|
||
| def check_duplicate_node_ids(g, verbose=True): | ||
| if g._nodes[g._node].duplicated().any(): |
There was a problem hiding this comment.
|
|
||
|
|
||
| def check_edge_sources_exist_in_nodes(g, verbose=True): | ||
| if not g._edges[g._source].isin(g._nodes[g._node]).all(): |
There was a problem hiding this comment.
|
|
||
|
|
||
| def check_edge_destinations_exist_in_nodes(g, verbose=True): | ||
| if not g._edges[g._destination].isin(g._nodes[g._node]).all(): |
There was a problem hiding this comment.
| if not check_duplicate_node_ids(g, verbose): | ||
| return False | ||
| check_edge_sources_exist_in_nodes(g, verbose) # Warnings only | ||
| check_edge_destinations_exist_in_nodes(g, verbose) # Warnings only |
There was a problem hiding this comment.
why not returning False?
There was a problem hiding this comment.
Good point, I guess I was considering only graph with edge IDs that could be valid.
However, to your point, we should error if g._node and g._nodes exist and there are missing node IDs from g._edges src/dst, that should throw an error.
| return True | ||
|
|
||
|
|
||
| def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
as a public function, should have a docstr matching other docstrs format
There was a problem hiding this comment.
see comments
bigger picture:
-
always-on for validated uploads: factor out fast metadata-only checks (e.g.,
validate_schema_bindingsbinding name <> col name match up) from data-level checks (e.g.,validate_indexeschecking col for nan, value mismatches, etc): updateArrowUploader::create_dataset->if validate: ...to include the fast metadata-only checks. Arguably we should do data-level checks here too, but that changes perf characteristics more fundamentally so let's not for now.. -
expose on
Plotter/PlotterBase, ideally calling both this +validate_encodings -
switch to
logging/warnvsverbse => print(), and report error count (# hits per violated condition) and ideally all errors vs early-exist on first -
add an ipynb demo
-
some potential items to warn or error on
| return True | ||
|
|
||
|
|
||
| def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
recommend also checking that dtypes of src/dst match, as does node/src if exist
type mismatch is more of a warning vs a fail, as we can support bipartite graphs, like imagine user:str <> day:int
There was a problem hiding this comment.
likewise, as a warn vs fail, if there is no overlap of src/dst edges, they may have a mismatch needing cleanup, imagine 123 <> id:123
Added a graph validator.
Checks: