|
| 1 | +# Test Coverage Analysis for DataFusion Python |
| 2 | + |
| 3 | +This document analyzes the current test coverage of the datafusion-python codebase and identifies areas that need improvement. |
| 4 | + |
| 5 | +## Executive Summary |
| 6 | + |
| 7 | +The datafusion-python project has approximately **192 test functions** across **20 test files** (~5,935 lines of test code). While core functionality is reasonably well-tested, there are significant gaps in several critical areas. |
| 8 | + |
| 9 | +### Key Findings |
| 10 | + |
| 11 | +| Area | Current Coverage | Priority | |
| 12 | +|------|-----------------|----------| |
| 13 | +| Math Functions | ~97% | Low (well covered) | |
| 14 | +| String Functions | ~84% | Medium | |
| 15 | +| Array/List Functions | ~100% | Low (well covered) | |
| 16 | +| Hash Functions | ~100% | Low (well covered) | |
| 17 | +| **Aggregation Functions** | **~29%** | **CRITICAL** | |
| 18 | +| **Window Functions** | **~0%** | **CRITICAL** | |
| 19 | +| User-Defined Functions | ~40% | High | |
| 20 | +| Object Store | ~5% | High | |
| 21 | +| Common Types | ~5% | Medium | |
| 22 | +| Error Handling | ~5% | High | |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Critical Coverage Gaps |
| 27 | + |
| 28 | +### 1. Aggregation Functions (CRITICAL) |
| 29 | + |
| 30 | +**Location**: `python/datafusion/functions.py` |
| 31 | +**Test File**: `python/tests/test_functions.py` |
| 32 | + |
| 33 | +The basic SQL aggregation functions have almost no dedicated tests: |
| 34 | + |
| 35 | +**NOT TESTED (22 functions):** |
| 36 | +- **Basic Aggregates**: `avg()`, `count()`, `count_star()`, `max()`, `min()`, `sum()` |
| 37 | +- **Statistical**: `corr()`, `covar()`, `covar_pop()`, `covar_samp()`, `median()`, `stddev()`, `stddev_pop()`, `stddev_samp()`, `var()`, `var_pop()`, `var_samp()`, `var_sample()` |
| 38 | +- **Bitwise**: `bit_and()`, `bit_or()`, `bit_xor()`, `bool_and()`, `bool_or()` |
| 39 | +- **Approximate**: `approx_distinct()`, `approx_median()`, `approx_percentile_cont()`, `approx_percentile_cont_with_weight()` |
| 40 | +- **String**: `string_agg()` |
| 41 | + |
| 42 | +**Recommendation**: Create `python/tests/test_aggregation_functions.py` with comprehensive tests for all aggregation functions. |
| 43 | + |
| 44 | +### 2. Window Functions (CRITICAL) |
| 45 | + |
| 46 | +**Location**: `python/datafusion/functions.py` |
| 47 | +**Test File**: None dedicated |
| 48 | + |
| 49 | +**ZERO COVERAGE for these functions:** |
| 50 | +- `row_number()` |
| 51 | +- `rank()` |
| 52 | +- `dense_rank()` |
| 53 | +- `lag()` |
| 54 | +- `lead()` |
| 55 | +- `first_value()` |
| 56 | +- `last_value()` |
| 57 | +- `nth_value()` |
| 58 | +- `cume_dist()` |
| 59 | +- `percent_rank()` |
| 60 | +- `ntile()` |
| 61 | + |
| 62 | +**Recommendation**: Create `python/tests/test_window_functions.py` with tests covering: |
| 63 | +- Basic window function usage |
| 64 | +- PARTITION BY clauses |
| 65 | +- ORDER BY clauses |
| 66 | +- Window frame specifications (ROWS, RANGE) |
| 67 | +- Combinations of partition/order/frame |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## High Priority Coverage Gaps |
| 72 | + |
| 73 | +### 3. Object Store Module |
| 74 | + |
| 75 | +**Location**: `python/datafusion/object_store.py` |
| 76 | +**Test File**: `python/tests/test_store.py` (minimal) |
| 77 | + |
| 78 | +| Object Store | Test Coverage | |
| 79 | +|--------------|---------------| |
| 80 | +| `LocalFileSystem` | NOT TESTED | |
| 81 | +| `AmazonS3` | NOT TESTED | |
| 82 | +| `GoogleCloud` | NOT TESTED | |
| 83 | +| `MicrosoftAzure` | NOT TESTED | |
| 84 | +| `Http` | Minimal (1 test) | |
| 85 | + |
| 86 | +**Recommendation**: Add tests for: |
| 87 | +- Object store instantiation with various configurations |
| 88 | +- Registration with SessionContext |
| 89 | +- Reading files from each store type |
| 90 | +- Error handling for invalid credentials/endpoints |
| 91 | + |
| 92 | +### 4. User-Defined Functions (UDF/UDAF/UDWF) |
| 93 | + |
| 94 | +**Location**: `python/datafusion/udf.py` |
| 95 | +**Test Files**: `test_udf.py`, `test_udaf.py`, `test_udwf.py` |
| 96 | + |
| 97 | +#### Missing Coverage: |
| 98 | + |
| 99 | +| Feature | ScalarUDF | UDAF | UDWF | |
| 100 | +|---------|:---------:|:----:|:----:| |
| 101 | +| Multi-argument functions | ❌ | ❌ | ✅ | |
| 102 | +| Null array handling | ❌ | ❌ | ❌ | |
| 103 | +| Empty array handling | ❌ | ❌ | ❌ | |
| 104 | +| Volatility=Stable | ❌ | ❌ | ❌ | |
| 105 | +| Volatility=Volatile | ❌ | ❌ | ❌ | |
| 106 | +| Enum volatility values | ❌ | ❌ | ❌ | |
| 107 | +| Multiple return types | ❌ | ❌ | ❌ | |
| 108 | +| Name auto-generation | ❌ | ❌ | ⚠️ | |
| 109 | +| Error propagation | ❌ | ❌ | ⚠️ | |
| 110 | + |
| 111 | +**Legend**: ✅ Tested, ⚠️ Partially tested, ❌ Not tested |
| 112 | + |
| 113 | +**Specific Gaps**: |
| 114 | + |
| 115 | +**ScalarUDF (`test_udf.py`):** |
| 116 | +- Only tests `pa.bool_()` return type - missing int, float, string, complex types |
| 117 | +- No multi-argument UDF tests |
| 118 | +- No volatility variations tested |
| 119 | +- No error handling tests |
| 120 | + |
| 121 | +**UDAF (`test_udaf.py`):** |
| 122 | +- Only tests `pa.float64()` return type |
| 123 | +- Known bug: code breaks on None (line 37 comment) - not tested |
| 124 | +- No multi-state accumulator tests |
| 125 | +- Limited merge operation testing |
| 126 | + |
| 127 | +**UDWF (`test_udwf.py`):** |
| 128 | +- Missing tests for `memoize()`, `is_causal()` methods |
| 129 | +- Only 2 window frame types tested |
| 130 | +- No tests for flag combinations |
| 131 | + |
| 132 | +### 5. Error Handling |
| 133 | + |
| 134 | +**Current State**: Only ~24 error handling tests across entire test suite |
| 135 | + |
| 136 | +**Missing Error Cases**: |
| 137 | +- Invalid SQL queries (malformed syntax) |
| 138 | +- Schema mismatch errors |
| 139 | +- Type coercion failures |
| 140 | +- Resource exhaustion scenarios |
| 141 | +- Invalid configuration options |
| 142 | +- Invalid UDF return types |
| 143 | +- Stream operation errors |
| 144 | +- File not found scenarios |
| 145 | +- Permission errors |
| 146 | + |
| 147 | +**Recommendation**: Add error handling tests to each module's test file. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Medium Priority Coverage Gaps |
| 152 | + |
| 153 | +### 6. SessionContext Methods |
| 154 | + |
| 155 | +**Location**: `python/datafusion/context.py` |
| 156 | +**Test File**: `python/tests/test_context.py` |
| 157 | + |
| 158 | +**Untested Methods**: |
| 159 | +- `session_id()` - Returns unique session identifier |
| 160 | +- `empty_table()` - Creates empty DataFrame |
| 161 | +- `enable_url_table()` - Enables querying local files as tables |
| 162 | +- `register_table_provider()` - Advanced table provider registration |
| 163 | + |
| 164 | +**SessionConfig Untested Methods**: |
| 165 | +- `with_batch_size()` |
| 166 | +- `with_repartition_sorts()` |
| 167 | +- `with_repartition_file_scans()` |
| 168 | +- `with_repartition_file_min_size()` |
| 169 | + |
| 170 | +**RuntimeEnvBuilder Untested Methods**: |
| 171 | +- `with_disk_manager_disabled()` |
| 172 | +- `with_unbounded_memory_pool()` |
| 173 | +- `with_greedy_memory_pool()` |
| 174 | + |
| 175 | +### 7. Expression Methods |
| 176 | + |
| 177 | +**Location**: `python/datafusion/expr.py` |
| 178 | +**Test File**: `python/tests/test_expr.py` |
| 179 | + |
| 180 | +**Untested Methods**: |
| 181 | +- `canonical_name()` - Complete string representation |
| 182 | +- `variant_name()` - Returns Expr variant name |
| 183 | +- `rex_type()` - Returns RexType classification |
| 184 | +- `types()` - Returns DataTypeMap |
| 185 | +- `python_value()` - Extracts value from literal |
| 186 | +- `rex_call_operands()` - Returns operands |
| 187 | +- `rex_call_operator()` - Extracts operator |
| 188 | +- `column_name()` - Compute output column name |
| 189 | + |
| 190 | +### 8. Plan Methods |
| 191 | + |
| 192 | +**Location**: `python/datafusion/plan.py` |
| 193 | +**Test File**: `python/tests/test_plans.py` |
| 194 | + |
| 195 | +**LogicalPlan Untested Methods**: |
| 196 | +- `display_indent_schema()` - Print indented schema |
| 197 | +- `display_graphviz()` - GraphViz visualization |
| 198 | + |
| 199 | +**ExecutionPlan Untested Methods**: |
| 200 | +- `children()` - Get list of child plans |
| 201 | +- `display_indent()` - Indented physical plan display |
| 202 | + |
| 203 | +### 9. Common Module Types |
| 204 | + |
| 205 | +**Location**: `python/datafusion/common.py` |
| 206 | +**Test File**: None dedicated |
| 207 | + |
| 208 | +**Untested Types**: |
| 209 | +- `DFSchema` - No functional tests |
| 210 | +- `DataType` - No tests |
| 211 | +- `DataTypeMap` - No tests |
| 212 | +- `PythonType` - No tests |
| 213 | +- `RexType` - No tests |
| 214 | +- `SqlFunction`, `SqlSchema`, `SqlStatistics`, `SqlTable`, `SqlType`, `SqlView` - No tests |
| 215 | + |
| 216 | +### 10. Temporal Functions |
| 217 | + |
| 218 | +**Location**: `python/datafusion/functions.py` |
| 219 | + |
| 220 | +**Untested (6 functions)**: |
| 221 | +- `current_date()` - Returns current UTC date |
| 222 | +- `current_time()` - Returns current UTC time |
| 223 | +- `make_date()` - Construct date from year, month, day |
| 224 | +- `now()` - Returns current timestamp |
| 225 | +- `to_unixtime()` - Convert to Unix time |
| 226 | +- `to_hex()` - Integer to hex string |
| 227 | + |
| 228 | +### 11. String Functions |
| 229 | + |
| 230 | +**Location**: `python/datafusion/functions.py` |
| 231 | + |
| 232 | +**Untested (7 functions)**: |
| 233 | +- `char_length()` - Alias for length |
| 234 | +- `find_in_set()` - Find string in comma-separated list |
| 235 | +- `instr()` - Alias for strpos |
| 236 | +- `levenshtein()` - Edit distance calculation |
| 237 | +- `position()` - Alias for strpos |
| 238 | +- `substr_index()` - Substring before N occurrences |
| 239 | +- `substring()` - With explicit position and length |
| 240 | + |
| 241 | +--- |
| 242 | + |
| 243 | +## Edge Cases Requiring Tests |
| 244 | + |
| 245 | +### Empty DataFrames |
| 246 | +- Current: Partially tested for `to_pandas()`, `to_polars()`, `to_arrow_table()` |
| 247 | +- Missing: Empty record batch streams, empty aggregation results |
| 248 | + |
| 249 | +### Null Value Handling |
| 250 | +- Current: Used in various tests but not dedicated testing |
| 251 | +- Missing: All-null batches, null in multi-argument functions, null handling in UDFs |
| 252 | + |
| 253 | +### Type Coercion |
| 254 | +- Current: Basic `cast()` test exists |
| 255 | +- Missing: Invalid cast operations, implicit type conversions, cross-type comparisons |
| 256 | + |
| 257 | +### Large Datasets |
| 258 | +- Current: No performance tests |
| 259 | +- Missing: Tests with millions of rows, memory efficiency tests, large batch handling |
| 260 | + |
| 261 | +--- |
| 262 | + |
| 263 | +## Recommended New Test Files |
| 264 | + |
| 265 | +1. **`test_aggregation_functions.py`** - All aggregation functions (~22 tests) |
| 266 | +2. **`test_window_functions.py`** - All window functions (~15 tests) |
| 267 | +3. **`test_object_store.py`** (expand) - All object store types (~20 tests) |
| 268 | +4. **`test_expression_builders.py`** - coalesce, nullif, in_list, struct (~15 tests) |
| 269 | +5. **`test_common_types.py`** - DFSchema, DataType, etc. (~15 tests) |
| 270 | +6. **`test_error_handling.py`** - Cross-module error cases (~30 tests) |
| 271 | +7. **`test_edge_cases.py`** - Empty, null, large datasets (~25 tests) |
| 272 | + |
| 273 | +--- |
| 274 | + |
| 275 | +## Summary of Test Improvements Needed |
| 276 | + |
| 277 | +| Priority | Area | Estimated Tests Needed | |
| 278 | +|----------|------|----------------------| |
| 279 | +| CRITICAL | Aggregation Functions | 25+ | |
| 280 | +| CRITICAL | Window Functions | 20+ | |
| 281 | +| HIGH | Object Store | 20+ | |
| 282 | +| HIGH | UDF/UDAF/UDWF Gaps | 30+ | |
| 283 | +| HIGH | Error Handling | 30+ | |
| 284 | +| MEDIUM | SessionContext Methods | 15+ | |
| 285 | +| MEDIUM | Expression Methods | 10+ | |
| 286 | +| MEDIUM | Common Types | 15+ | |
| 287 | +| MEDIUM | Temporal Functions | 10+ | |
| 288 | +| LOW | String Function Aliases | 5+ | |
| 289 | +| LOW | Edge Cases | 25+ | |
| 290 | +| **TOTAL** | | **~200+ tests** | |
| 291 | + |
| 292 | +--- |
| 293 | + |
| 294 | +## Quick Wins |
| 295 | + |
| 296 | +These tests can be added with minimal effort: |
| 297 | + |
| 298 | +1. **Aggregation basics**: Add tests for `sum()`, `count()`, `avg()`, `min()`, `max()` - these are one-liners |
| 299 | +2. **Window function basics**: Add tests for `row_number()`, `rank()`, `dense_rank()` |
| 300 | +3. **SessionContext.session_id()**: Simple property access test |
| 301 | +4. **UDF with multiple arguments**: Extend existing test patterns |
| 302 | +5. **Volatility enum values**: Add parametrized tests to existing UDF tests |
| 303 | + |
| 304 | +--- |
| 305 | + |
| 306 | +## Conclusion |
| 307 | + |
| 308 | +While the datafusion-python test suite provides good coverage for many core features, there are critical gaps in: |
| 309 | + |
| 310 | +1. **SQL aggregation functions** - The most commonly used SQL operations |
| 311 | +2. **Window functions** - Entire category with zero coverage |
| 312 | +3. **Object stores** - Critical for cloud deployments |
| 313 | +4. **Error handling** - Essential for production reliability |
| 314 | + |
| 315 | +Addressing these gaps would significantly improve the reliability and maintainability of the project. |
0 commit comments