-
-
Notifications
You must be signed in to change notification settings - Fork 159
Add nixpkgs license clarity analysis - Fixes #1940 #2033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add nixpkgs license clarity analysis - Fixes #1940 #2033
Conversation
Implement automated license issue detection and reporting for Nixpkgs packages. Features: - Automatically detect incorrect license declarations - Determine correct licenses with confidence scores - Apply nixpkgs-specific rules for diverse tech stacks - Flag packages and detections needing review - Generate comprehensive license reports Implementation: - scanpipe/pipes/nixpkgs.py: Core license analysis utilities (13 functions) - scanpipe/pipelines/analyze_nixpkgs_licenses.py: Automated pipeline - scanpipe/tests/test_nixpkgs.py: Comprehensive test suite (16 tests) Detection capabilities: - Missing/invalid license declarations - License mismatches (declared vs detected) - Ecosystem-specific license validation (Python, Rust, Node.js, Go, etc.) - Copyleft compliance issues - License file inconsistencies - Ambiguous detections requiring review Output stored in project.extra_data with severity levels (error/warning/info) and automated suggestions for corrections. Addresses parent issue aboutcode-org#1939 Signed-off-by: Sahil Lenka <sahillenka44@gmail.com>
ec2a7fd to
23f3a90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements automated license issue detection and reporting for Nixpkgs packages. The implementation adds a new pipeline (analyze_nixpkgs_licenses.py), core utility functions (pipes/nixpkgs.py), and a comprehensive test suite to detect incorrect license declarations, determine correct licenses with confidence scores, and apply nixpkgs-specific rules for diverse tech stacks.
Changes:
- New pipeline for automated Nixpkgs license analysis with 6 custom steps
- 13 utility functions in nixpkgs.py for license detection, validation, and reporting
- 16 test cases covering the core functionality
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 19 comments.
| File | Description |
|---|---|
| scanpipe/tests/test_nixpkgs.py | Comprehensive test suite for nixpkgs license analysis utilities with 16 test cases |
| scanpipe/pipes/nixpkgs.py | Core license analysis utilities including detection, validation, ecosystem checking, and reporting functions |
| scanpipe/pipelines/analyze_nixpkgs_licenses.py | Automated pipeline implementing license scanning, issue detection, flagging, and report generation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scanpipe/pipes/nixpkgs.py
Outdated
| if any(lic in name_lower for lic in ["license", "licence", "copying", "copyright"]): | ||
| license_files.append(resource) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'name_lower' variable is assigned but the check uses multiple substrings ('license', 'licence', 'copying', 'copyright'). This will match many false positives like 'copyright_notice.txt' which may not be a license file. The logic on line 522 appears to use the same broad matching. Consider more precise matching patterns or using the resource.is_legal attribute.
| if any(lic in name_lower for lic in ["license", "licence", "copying", "copyright"]): | |
| license_files.append(resource) | |
| # Prefer using an explicit legal-file flag when available to avoid false positives. | |
| is_legal = getattr(resource, "is_legal", False) | |
| if is_legal: | |
| license_files.append(resource) | |
| continue | |
| # Fall back to matching common canonical license filenames by base name only. | |
| _base_name = name_lower.split(".", 1)[0] | |
| if _base_name in {"license", "licence", "copying", "copyright"}: | |
| license_files.append(resource) |
| NIXPKGS_ECOSYSTEM_LICENSE_PATTERNS = { | ||
| "python": ["MIT", "Apache-2.0", "BSD-3-Clause", "GPL-3.0-or-later"], | ||
| "rust": ["MIT", "Apache-2.0", "MIT OR Apache-2.0"], | ||
| "nodejs": ["MIT", "ISC", "BSD-3-Clause"], | ||
| "go": ["MIT", "Apache-2.0", "BSD-3-Clause"], | ||
| "haskell": ["BSD-3-Clause", "MIT"], | ||
| "ruby": ["MIT", "GPL-2.0-or-later"], | ||
| "perl": ["Artistic-2.0", "GPL-1.0-or-later"], | ||
| "java": ["Apache-2.0", "MIT", "LGPL-2.1-or-later"], | ||
| } |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NIXPKGS_ECOSYSTEM_LICENSE_PATTERNS dictionary maps package types to expected licenses, but the package.type field may not always match these keys exactly. For example, a package type might be 'pypi' (as used in tests) but the dictionary key is 'python'. This mismatch will cause check_nixpkgs_ecosystem_license to always return None for PyPI packages. Either update the dictionary keys to match actual package types or implement a mapping between package types and ecosystem names.
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # http://nexb.com and https://github.com/aboutcode-org/scancode.io | ||
| # The ScanCode.io software is licensed under the Apache License version 2.0. | ||
| # Data generated with ScanCode.io is provided as-is without warranties. | ||
| # ScanCode is a trademark of nexB Inc. | ||
| # | ||
| # You may not use this software except in compliance with the License. | ||
| # You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0 | ||
| # Unless required by applicable law or agreed to in writing, software distributed | ||
| # under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR | ||
| # CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations under the License. | ||
| # | ||
| # Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES | ||
| # OR CONDITIONS OF ANY KIND, either express or implied. No content created from | ||
| # ScanCode.io should be considered or used as legal advice. Consult an Attorney | ||
| # for any legal advice. | ||
| # | ||
| # ScanCode.io is a free software code scanning tool from nexB Inc. and others. | ||
| # Visit https://github.com/aboutcode-org/scancode.io for support and download. | ||
|
|
||
| from django.test import TestCase | ||
|
|
||
| from scanpipe.models import CodebaseResource | ||
| from scanpipe.models import DiscoveredPackage | ||
| from scanpipe.models import Project | ||
| from scanpipe.pipes import nixpkgs | ||
|
|
||
|
|
||
| class NixpkgsLicenseAnalysisTest(TestCase): | ||
| def setUp(self): | ||
| self.project = Project.objects.create(name="Test Nixpkgs Project") | ||
|
|
||
| def test_detect_missing_declared_license(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| ) | ||
|
|
||
| issues = nixpkgs.detect_package_license_issues(package) | ||
|
|
||
| self.assertEqual(len(issues), 1) | ||
| self.assertEqual(issues[0]["type"], "missing_declared_license") | ||
| self.assertEqual(issues[0]["severity"], "error") | ||
|
|
||
| def test_detect_license_mismatch(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| # Create a resource with different detected license | ||
| resource = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="test.py", | ||
| detected_license_expression="GPL-3.0-only", | ||
| ) | ||
| package.codebase_resources.add(resource) | ||
|
|
||
| issues = nixpkgs.detect_package_license_issues(package) | ||
|
|
||
| # Should detect mismatch | ||
| mismatch_issues = [i for i in issues if i["type"] == "license_mismatch"] | ||
| self.assertTrue(len(mismatch_issues) > 0) | ||
| self.assertEqual(mismatch_issues[0]["severity"], "warning") | ||
|
|
||
| def test_normalize_nixpkgs_license(self): | ||
| # Test common nixpkgs license mappings | ||
| self.assertEqual( | ||
| nixpkgs.normalize_nixpkgs_license("gpl2+"), | ||
| "GPL-2.0-or-later" | ||
| ) | ||
| self.assertEqual( | ||
| nixpkgs.normalize_nixpkgs_license("apache2"), | ||
| "Apache-2.0" | ||
| ) | ||
| self.assertEqual( | ||
| nixpkgs.normalize_nixpkgs_license("mit"), | ||
| "MIT" | ||
| ) | ||
|
|
||
| # Test unknown license passes through | ||
| self.assertEqual( | ||
| nixpkgs.normalize_nixpkgs_license("CustomLicense"), | ||
| "CustomLicense" | ||
| ) | ||
|
|
||
| def test_check_nixpkgs_ecosystem_license(self): | ||
| # Python package with unusual license | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="Artistic-2.0", | ||
| ) | ||
|
|
||
| issue = nixpkgs.check_nixpkgs_ecosystem_license(package) | ||
|
|
||
| self.assertIsNotNone(issue) | ||
| self.assertEqual(issue["type"], "unusual_ecosystem_license") | ||
| self.assertEqual(issue["severity"], "info") | ||
|
|
||
| def test_check_nixpkgs_ecosystem_license_typical(self): | ||
| # Python package with typical license | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| issue = nixpkgs.check_nixpkgs_ecosystem_license(package) | ||
|
|
||
| # No issue expected for typical license | ||
| self.assertIsNone(issue) | ||
|
|
||
| def test_detect_copyleft_with_dependencies(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="GPL-3.0-only", | ||
| ) | ||
|
|
||
| issues = nixpkgs.detect_copyleft_compliance_issues(package) | ||
|
|
||
| # Note: This test would need dependencies to trigger the issue | ||
| # For now, it should return empty list | ||
| self.assertIsInstance(issues, list) | ||
|
|
||
| def test_suggest_license_correction_single_detected(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="GPL-3.0-only", | ||
| ) | ||
|
|
||
| detected_licenses = ["MIT"] | ||
| suggestion = nixpkgs.suggest_license_correction(package, detected_licenses) | ||
|
|
||
| self.assertIsNotNone(suggestion) | ||
| self.assertEqual(suggestion["suggested_license"], "MIT") | ||
| self.assertEqual(suggestion["confidence"], "high") | ||
|
|
||
| def test_suggest_license_correction_multiple_detected(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="GPL-3.0-only", | ||
| ) | ||
|
|
||
| detected_licenses = ["MIT", "Apache-2.0"] | ||
| suggestion = nixpkgs.suggest_license_correction(package, detected_licenses) | ||
|
|
||
| self.assertIsNotNone(suggestion) | ||
| self.assertIn("OR", suggestion["suggested_license"]) | ||
| self.assertEqual(suggestion["confidence"], "medium") | ||
|
|
||
| def test_suggest_license_correction_no_detected(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| suggestion = nixpkgs.suggest_license_correction(package, []) | ||
|
|
||
| self.assertIsNone(suggestion) | ||
|
|
||
| def test_analyze_license_issues(self): | ||
| # Create package with issue | ||
| DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="package-with-issue", | ||
| version="1.0.0", | ||
| # No declared license - should trigger issue | ||
| ) | ||
|
|
||
| # Create package without issue | ||
| DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="package-without-issue", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| issues = nixpkgs.analyze_license_issues(self.project) | ||
|
|
||
| # Should have issues for one package | ||
| self.assertEqual(len(issues), 1) | ||
|
|
||
| def test_generate_license_report(self): | ||
| # Create package with issue | ||
| DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="package1", | ||
| version="1.0.0", | ||
| ) | ||
|
|
||
| # Create package without issue | ||
| DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="package2", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| report = nixpkgs.generate_license_report(self.project) | ||
|
|
||
| self.assertIn("summary", report) | ||
| self.assertIn("by_severity", report) | ||
| self.assertIn("by_type", report) | ||
| self.assertIn("issues_by_package", report) | ||
|
|
||
| summary = report["summary"] | ||
| self.assertEqual(summary["total_packages"], 2) | ||
| self.assertEqual(summary["packages_with_issues"], 1) | ||
| self.assertEqual(summary["packages_without_issues"], 1) | ||
|
|
||
| def test_check_license_clarity_unclear_license(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="Unknown", | ||
| ) | ||
|
|
||
| issues = nixpkgs.check_license_clarity(package) | ||
|
|
||
| unclear_issues = [i for i in issues if i["type"] == "unclear_license"] | ||
| self.assertTrue(len(unclear_issues) > 0) | ||
| self.assertEqual(unclear_issues[0]["severity"], "warning") | ||
|
|
||
| def test_get_detected_licenses_for_package(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| ) | ||
|
|
||
| # Create resources with licenses | ||
| resource1 = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="file1.py", | ||
| detected_license_expression="MIT", | ||
| ) | ||
| resource2 = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="file2.py", | ||
| detected_license_expression="Apache-2.0", | ||
| ) | ||
|
|
||
| package.codebase_resources.add(resource1, resource2) | ||
|
|
||
| detected = nixpkgs.get_detected_licenses_for_package(package) | ||
|
|
||
| self.assertEqual(len(detected), 2) | ||
| self.assertIn("MIT", detected) | ||
| self.assertIn("Apache-2.0", detected) | ||
|
|
||
| def test_detect_license_file_issues_multiple_files(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| # Create multiple license files | ||
| license1 = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="LICENSE", | ||
| name="LICENSE", | ||
| ) | ||
| license2 = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="COPYING", | ||
| name="COPYING", | ||
| ) | ||
|
|
||
| package.codebase_resources.add(license1, license2) | ||
|
|
||
| issues = nixpkgs.detect_license_file_issues(package) | ||
|
|
||
| multiple_file_issues = [ | ||
| i for i in issues if i["type"] == "multiple_license_files" | ||
| ] | ||
| self.assertTrue(len(multiple_file_issues) > 0) | ||
|
|
||
| def test_are_licenses_compatible_exact_match(self): | ||
| from licensedcode.cache import get_licensing | ||
|
|
||
| licensing = get_licensing() | ||
| lic1 = licensing.parse("MIT", validate=True) | ||
| lic2 = licensing.parse("MIT", validate=True) | ||
|
|
||
| result = nixpkgs.are_licenses_compatible(lic1, lic2, licensing) | ||
|
|
||
| self.assertTrue(result) | ||
|
|
||
| def test_are_licenses_compatible_different(self): | ||
| from licensedcode.cache import get_licensing | ||
|
|
||
| licensing = get_licensing() | ||
| lic1 = licensing.parse("MIT", validate=True) | ||
| lic2 = licensing.parse("GPL-3.0-only", validate=True) | ||
|
|
||
| result = nixpkgs.are_licenses_compatible(lic1, lic2, licensing) | ||
|
|
||
| self.assertFalse(result) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test file is placed in the wrong directory. Based on the codebase structure, tests for pipe modules (scanpipe/pipes/*.py) should be located in scanpipe/tests/pipes/ directory, not directly in scanpipe/tests/. This test file should be moved to scanpipe/tests/pipes/test_nixpkgs.py to match the convention used for other pipe tests like test_scancode.py, test_compliance.py, etc.
scanpipe/pipes/nixpkgs.py
Outdated
| if resource.is_legal or 'license' in resource.name.lower(): | ||
| has_license_file = True | ||
| break |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The license file detection logic is potentially unreliable. Checking if 'license' is in resource.name.lower() will match files like 'unlicensed.txt' or 'license_check.py' which are not actual license files. Additionally, using resource.is_legal would be more accurate as it's specifically designed to identify legal files. Consider using resource.is_legal as the primary check, with the name check as a fallback.
| if resource.is_legal or 'license' in resource.name.lower(): | |
| has_license_file = True | |
| break | |
| # Prefer the dedicated legal/notice flag, then fall back to common license filenames. | |
| if resource.is_legal: | |
| has_license_file = True | |
| break | |
| resource_name = resource.name.lower() | |
| if ( | |
| resource_name in { | |
| "license", | |
| "license.txt", | |
| "license.md", | |
| "copying", | |
| "copying.txt", | |
| "copying.md", | |
| } | |
| or resource_name.startswith("license.") | |
| or resource_name.startswith("copying.") | |
| ): | |
| has_license_file = True | |
| break |
| def test_generate_license_report(self): | ||
| # Create package with issue | ||
| DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="package1", | ||
| version="1.0.0", | ||
| ) | ||
|
|
||
| # Create package without issue | ||
| DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="package2", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| report = nixpkgs.generate_license_report(self.project) | ||
|
|
||
| self.assertIn("summary", report) | ||
| self.assertIn("by_severity", report) | ||
| self.assertIn("by_type", report) | ||
| self.assertIn("issues_by_package", report) | ||
|
|
||
| summary = report["summary"] | ||
| self.assertEqual(summary["total_packages"], 2) | ||
| self.assertEqual(summary["packages_with_issues"], 1) | ||
| self.assertEqual(summary["packages_without_issues"], 1) | ||
|
|
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test lacks coverage for error conditions and edge cases. For example, there are no tests for: (1) handling invalid license expressions in detected licenses, (2) behavior when package.codebase_resources is empty, (3) handling of packages with no type field, and (4) performance with large numbers of resources. Consider adding tests for these scenarios to ensure robustness.
| # Add suggestion to notes | ||
| suggestion_note = ( | ||
| f"\nSuggested license: {suggestion['suggested_license']} " | ||
| f"(confidence: {suggestion['confidence']})\n" | ||
| f"Reason: {suggestion['reason']}" | ||
| ) | ||
| package.update(notes=package.notes + suggestion_note) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous issue, the suggestion note is appended without checking if it already exists. Running the pipeline multiple times will create duplicate suggestion entries in the package notes. Consider implementing idempotent note updates.
| # Add suggestion to notes | |
| suggestion_note = ( | |
| f"\nSuggested license: {suggestion['suggested_license']} " | |
| f"(confidence: {suggestion['confidence']})\n" | |
| f"Reason: {suggestion['reason']}" | |
| ) | |
| package.update(notes=package.notes + suggestion_note) | |
| # Add suggestion to notes in an idempotent way | |
| suggestion_note = ( | |
| f"\nSuggested license: {suggestion['suggested_license']} " | |
| f"(confidence: {suggestion['confidence']})\n" | |
| f"Reason: {suggestion['reason']}" | |
| ) | |
| current_notes = package.notes or "" | |
| if suggestion_note not in current_notes: | |
| if current_notes: | |
| new_notes = current_notes + suggestion_note | |
| else: | |
| # Strip leading newline if there were no previous notes | |
| new_notes = suggestion_note.lstrip("\n") | |
| package.update(notes=new_notes) |
| @classmethod | ||
| def steps(cls): | ||
| return ( | ||
| cls.copy_inputs_to_codebase_directory, | ||
| cls.extract_archives, | ||
| cls.collect_and_create_codebase_resources, | ||
| cls.flag_empty_files, | ||
| cls.flag_ignored_resources, | ||
| cls.scan_for_application_packages, | ||
| cls.scan_for_files, | ||
| cls.collect_and_create_license_detections, | ||
| cls.analyze_nixpkgs_license_issues, | ||
| cls.flag_packages_with_license_issues, | ||
| cls.flag_license_detections_needing_review, | ||
| cls.generate_nixpkgs_license_report, | ||
| ) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline references steps that are not defined in the class or its parent: 'copy_inputs_to_codebase_directory', 'collect_and_create_codebase_resources', 'scan_for_application_packages', 'scan_for_files', and 'collect_and_create_license_detections'. While these methods exist in the ScanCodebase pipeline, AnalyzeNixpkgsLicenses inherits from Pipeline (not ScanCodebase), so these methods are not available. Either inherit from ScanCodebase instead of Pipeline, or implement these methods in this class.
scanpipe/tests/test_nixpkgs.py
Outdated
|
|
||
| # Should detect mismatch | ||
| mismatch_issues = [i for i in issues if i["type"] == "license_mismatch"] | ||
| self.assertTrue(len(mismatch_issues) > 0) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertTrue(a > b) cannot provide an informative message. Using assertGreater(a, b) instead will give more informative messages.
| self.assertTrue(len(mismatch_issues) > 0) | |
| self.assertGreater(len(mismatch_issues), 0) |
scanpipe/tests/test_nixpkgs.py
Outdated
| issues = nixpkgs.check_license_clarity(package) | ||
|
|
||
| unclear_issues = [i for i in issues if i["type"] == "unclear_license"] | ||
| self.assertTrue(len(unclear_issues) > 0) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertTrue(a > b) cannot provide an informative message. Using assertGreater(a, b) instead will give more informative messages.
| self.assertTrue(len(unclear_issues) > 0) | |
| self.assertGreater(len(unclear_issues), 0) |
scanpipe/tests/test_nixpkgs.py
Outdated
| multiple_file_issues = [ | ||
| i for i in issues if i["type"] == "multiple_license_files" | ||
| ] | ||
| self.assertTrue(len(multiple_file_issues) > 0) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertTrue(a > b) cannot provide an informative message. Using assertGreater(a, b) instead will give more informative messages.
| self.assertTrue(len(multiple_file_issues) > 0) | |
| self.assertGreater(len(multiple_file_issues), 0) |
- Move test file to correct location (scanpipe/tests/pipes/) - Add package type to ecosystem mapping (pypi->python, npm->nodejs, etc.) - Improve license file detection with precise matching - Fix pipeline inheritance (now extends ScanCodebase) - Add database query optimization with prefetch_related - Use exact matching for package lookup instead of contains - Make notes updates idempotent to prevent duplicates - Use assertGreater instead of assertTrue for better test output - Improve license file detection to avoid false positives Signed-off-by: Sahil-u07 <sahilbagul27@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for expected in expected_licenses: | ||
| if expected.lower() in declared.lower(): | ||
| return None |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the are_licenses_compatible function, this check uses simple substring matching with expected.lower() in declared.lower(). This could result in false matches. For example, if expected is "MIT", it would match "LIMITED" or "SUBMITTED". Consider using more precise matching that respects word boundaries or validates against parsed license expressions.
scanpipe/pipes/nixpkgs.py
Outdated
| proprietary_indicators = ["proprietary", "commercial", "closed"] | ||
| description = (package.description or "").lower() | ||
| notes = (package.notes or "").lower() | ||
|
|
||
| if any(ind in description or ind in notes for ind in proprietary_indicators): |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other substring matching issues, the proprietary indicators check uses simple substring matching (e.g., "commercial" in description). This could match unintended words like "uncommercial", "noncommercial", or "commercial-free". Consider using more precise matching with word boundaries or regex patterns to avoid false positives.
| # Get license detections that need review from the package's project | ||
| ambiguous_licenses = package.project.discoveredlicenses.filter( | ||
| needs_review=True | ||
| ) | ||
|
|
||
| # Get file regions related to this package | ||
| package_paths = set(package.codebase_resources.values_list('path', flat=True)) | ||
|
|
||
| for lic in ambiguous_licenses: | ||
| # Check if this detection appears in package files | ||
| detection_paths = {fr.get('path') for fr in lic.file_regions} | ||
| if package_paths & detection_paths: |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check_ambiguous_detections function queries all license detections with needs_review=True for the entire project, then filters them by checking if they intersect with package paths. For projects with many packages and many license detections, this could be inefficient as it retrieves all ambiguous licenses even if most don't relate to the current package. Consider optimizing by filtering discoveredlicenses to only those whose file_regions intersect with the package's resources, possibly by using a database query if the file_regions field structure supports it.
scanpipe/pipes/nixpkgs.py
Outdated
| for indicator in unclear_indicators: | ||
| if indicator in declared_lower: |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unclear license detection uses simple substring matching (e.g., "other" in declared_lower). This could lead to false positives. For example, "other" would match "another", "brother", "otherwise", etc. Similarly, "free" would match "freetype", "freestyle", etc. Consider using more precise matching such as word boundary checks or splitting the license expression on known delimiters before checking for these terms.
scanpipe/pipes/nixpkgs.py
Outdated
| copyleft_indicators = ["gpl", "agpl", "lgpl", "mpl", "epl", "cpl"] | ||
| is_copyleft = any(ind in declared_lower for ind in copyleft_indicators) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyleft detection uses simple substring matching (e.g., "gpl" in declared_lower). This could lead to false positives where "gpl" appears as part of other words. For example, "gpl" could match "single-player", or "cpl" could match "compliance". Consider using more precise matching such as word boundaries or validating against parsed license expressions to avoid false matches.
| current_notes = package.notes or "" | ||
| if suggestion_note not in current_notes: | ||
| package.update(notes=current_notes + suggestion_note) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current_notes variable is retrieved from package.notes on line 167, but this happens after the package notes were already updated on line 143. This means current_notes will contain the newly updated notes (including the "=== License Issues ===" section), not the original notes. The suggestion_note could be appended to the wrong location or create duplicates. Consider refetching the package or using the new_notes variable from line 142 instead.
scanpipe/pipes/nixpkgs.py
Outdated
| base_name = name_lower.split(".", 1)[0] | ||
| if base_name in {"license", "licence", "copying", "copyright"}: |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's inconsistency in how license files are detected between check_license_clarity (lines 274-285) and detect_license_file_issues (lines 546-556). In check_license_clarity, the code checks for exact matches like "license", "license.txt", "license.md" as well as files starting with "license." or "copying.". However, in detect_license_file_issues, it only checks the base name (before the first dot) against a set. This means detect_license_file_issues would not match files like "license.txt" or "license.md" since their base name is just "license", which would match. But it would miss files like "license-mit.txt" which check_license_clarity would catch. Consider using a consistent approach in both functions.
| base_name = name_lower.split(".", 1)[0] | |
| if base_name in {"license", "licence", "copying", "copyright"}: | |
| if ( | |
| # Exact common license filenames | |
| name_lower in { | |
| "license", | |
| "license.txt", | |
| "license.md", | |
| "licence", | |
| "licence.txt", | |
| "licence.md", | |
| "copying", | |
| "copying.txt", | |
| "copying.md", | |
| "copyright", | |
| "copyright.txt", | |
| "copyright.md", | |
| } | |
| # Files starting with common license-related prefixes | |
| or name_lower.startswith( | |
| ( | |
| "license.", | |
| "license-", | |
| "licence.", | |
| "licence-", | |
| "copying.", | |
| "copying-", | |
| "copyright.", | |
| "copyright-", | |
| ) | |
| ) | |
| ): |
scanpipe/pipes/nixpkgs.py
Outdated
|
|
||
| for lic in ambiguous_licenses: | ||
| # Check if this detection appears in package files | ||
| detection_paths = {fr.get('path') for fr in lic.file_regions} |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file_regions field access pattern assumes file_regions is a list of dictionaries with a 'path' key. However, if file_regions has a different structure or if the 'path' key is missing, this will silently return None values in the set. Consider using fr.get('path') with a check for None values, or validate the structure of file_regions before processing to ensure more robust error handling.
| detection_paths = {fr.get('path') for fr in lic.file_regions} | |
| detection_paths = { | |
| fr_path | |
| for fr in (lic.file_regions or []) | |
| if isinstance(fr, dict) | |
| for fr_path in [fr.get("path")] | |
| if fr_path is not None | |
| } |
scanpipe/pipes/nixpkgs.py
Outdated
| # Check if one contains the other (e.g., for compound expressions) | ||
| if detected_str in declared_str or declared_str in detected_str: | ||
| return True |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The license compatibility check uses simple string containment (e.g., "mit" in "limited"). This could lead to false positives where a license identifier is accidentally found as a substring of another word. For example, "mit" could match "submitted", "limited", etc. Consider using more precise matching such as splitting on common delimiters (spaces, dashes, OR, AND) or using word boundaries to avoid partial matches.
|
|
||
| from django.db import models | ||
|
|
||
| from scanpipe.pipelines import Pipeline |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Pipeline' is not used.
| from scanpipe.pipelines import Pipeline |
- Use word boundary regex matching for all substring checks (fixes false positives like 'mit' matching 'limited') - Consistent license file detection across functions - Better error handling for file_regions structure - Fix current_notes being fetched after package update - Separate merged test methods into distinct tests - Remove unused Pipeline import - Use regex word boundaries for: * License compatibility checks * Unclear license indicators * Ecosystem license patterns * Copyleft detection * Proprietary indicators All substring matching now uses \\b word boundaries to prevent false matches while maintaining functionality. Signed-off-by: Sahil-u07 <sahilbagul27@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 21 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_detect_copyleft_with_dependencies(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="GPL-3.0-only", | ||
| ) | ||
|
|
||
| issues = nixpkgs.detect_copyleft_compliance_issues(package) | ||
|
|
||
| # Note: This test would need dependencies to trigger the issue | ||
| # For now, it should return empty list | ||
| self.assertIsInstance(issues, list) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for detect_copyleft_compliance_issues only checks that it returns a list, but doesn't verify any actual copyleft detection scenarios. The test comment at line 136 acknowledges this ("This test would need dependencies to trigger the issue"). Consider adding test cases that actually create dependencies and verify that copyleft issues are properly detected, or at minimum test the proprietary conflict detection path (lines 560-566 in nixpkgs.py).
| def test_check_license_clarity_unclear_license(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package-unclear", | ||
| version="1.0.0", | ||
| declared_license_expression="Unknown", | ||
| ) | ||
|
|
||
| issues = nixpkgs.check_license_clarity(package) | ||
|
|
||
| unclear_issues = [i for i in issues if i["type"] == "unclear_license"] | ||
| self.assertGreater(len(unclear_issues), 0) | ||
| self.assertEqual(unclear_issues[0]["severity"], "warning") |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the missing_license_file scenario in check_license_clarity (lines 306-312 in nixpkgs.py). The test suite should verify that when a package has a declared_license_expression but no license file, the appropriate info-level issue is generated.
scanpipe/pipes/nixpkgs.py
Outdated
| declared_lower = package.declared_license_expression.lower() | ||
|
|
||
| # Check for copyleft licenses using word boundary matching | ||
| import re |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement is placed inside the function instead of at the module level. This violates Python best practices and PEP 8 guidelines. The 're' module should be imported at the top of the file.
scanpipe/pipes/nixpkgs.py
Outdated
| # Split on common license expression delimiters for more precise matching | ||
| import re | ||
| declared_parts = set(re.split(r'[\s\-_()]+|\bor\b|\band\b|\bwith\b', declared_str)) | ||
| detected_parts = set(re.split(r'[\s\-_()]+|\bor\b|\band\b|\bwith\b', detected_str)) | ||
|
|
||
| # Remove empty strings from split | ||
| declared_parts.discard('') | ||
| detected_parts.discard('') | ||
|
|
||
| # Check if detected parts are subset of declared or vice versa | ||
| if detected_parts and declared_parts: | ||
| if detected_parts.issubset(declared_parts) or declared_parts.issubset(detected_parts): | ||
| return True | ||
|
|
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This license compatibility check has a significant logical flaw. The regex pattern splits on word boundaries including common license operators like 'or', 'and', and 'with', which breaks the semantic meaning of license expressions. For example, "GPL-3.0-or-later" would be split into parts including "or", treating the operator as a separate component. This could lead to false positives where incompatible licenses are considered compatible because they share common words like "or" or "and".
| detection_paths = { | ||
| fr_path | ||
| for fr in (lic.file_regions or []) | ||
| if isinstance(fr, dict) | ||
| for fr_path in [fr.get("path")] | ||
| if fr_path is not None | ||
| } |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function uses a set comprehension with nested list comprehensions that could raise exceptions if the data structure doesn't match expectations. Specifically, if 'file_regions' is None or if individual items in it aren't dictionaries, the code will fail. Add defensive checks to handle cases where file_regions might be None or contain non-dict items.
| def test_detect_license_mismatch(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| # Create a resource with different detected license | ||
| resource = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="test.py", | ||
| detected_license_expression="GPL-3.0-only", | ||
| ) | ||
| package.codebase_resources.add(resource) | ||
|
|
||
| issues = nixpkgs.detect_package_license_issues(package) | ||
|
|
||
| # Should detect mismatch | ||
| mismatch_issues = [i for i in issues if i["type"] == "license_mismatch"] | ||
| self.assertGreater(len(mismatch_issues), 0) | ||
| self.assertEqual(mismatch_issues[0]["severity"], "warning") |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the find_license_inconsistencies function when the declared license is invalid (lines 158-168 in nixpkgs.py). The test suite should include a case where the declared_license expression fails to parse, to verify that appropriate error issues are returned with the correct severity and message.
| def test_detect_license_file_issues_multiple_files(self): | ||
| package = DiscoveredPackage.objects.create( | ||
| project=self.project, | ||
| type="pypi", | ||
| name="test-package", | ||
| version="1.0.0", | ||
| declared_license_expression="MIT", | ||
| ) | ||
|
|
||
| # Create multiple license files | ||
| license1 = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="LICENSE", | ||
| name="LICENSE", | ||
| ) | ||
| license2 = CodebaseResource.objects.create( | ||
| project=self.project, | ||
| path="COPYING", | ||
| name="COPYING", | ||
| ) | ||
|
|
||
| package.codebase_resources.add(license1, license2) | ||
|
|
||
| issues = nixpkgs.detect_license_file_issues(package) | ||
|
|
||
| multiple_file_issues = [ | ||
| i for i in issues if i["type"] == "multiple_license_files" | ||
| ] | ||
| self.assertGreater(len(multiple_file_issues), 0) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the license_file_without_declaration scenario in detect_license_file_issues (lines 617-625 in nixpkgs.py). The test suite should verify that when license files exist but no declared_license_expression is present, the appropriate warning-level issue is generated.
scanpipe/pipes/nixpkgs.py
Outdated
|
|
||
| # Check if license is too generic or unclear | ||
| if package.declared_license_expression: | ||
| import re |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement is placed inside the function instead of at the module level. This violates Python best practices and PEP 8 guidelines. The 're' module should be imported at the top of the file.
| package.update(notes=new_notes) | ||
|
|
||
| # Get detected licenses for this package | ||
| detected_licenses = nixpkgs.get_detected_licenses_for_package(package) | ||
|
|
||
| # Try to suggest correction if declared license is wrong | ||
| if detected_licenses: | ||
| suggestion = nixpkgs.suggest_license_correction( | ||
| package, | ||
| detected_licenses | ||
| ) | ||
| if suggestion and package.declared_license_expression: | ||
| if suggestion["suggested_license"] != package.declared_license_expression: | ||
| self.log( | ||
| f"Package {package}: suggested license " | ||
| f"'{suggestion['suggested_license']}' " | ||
| f"(confidence: {suggestion['confidence']})" | ||
| ) | ||
| # Add suggestion to notes (idempotent) - use the already updated notes | ||
| suggestion_note = ( | ||
| f"\nSuggested license: {suggestion['suggested_license']} " | ||
| f"(confidence: {suggestion['confidence']})\n" | ||
| f"Reason: {suggestion['reason']}" | ||
| ) | ||
| # Refresh package from DB to get latest notes after previous update | ||
| package.refresh_from_db() | ||
| current_notes = package.notes or "" | ||
| if suggestion_note not in current_notes: | ||
| package.update(notes=current_notes + suggestion_note) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation has a potential race condition or stale data issue. After updating the package notes at line 142, the code then calls refresh_from_db() at line 167, but only if a suggestion exists. This means the second update at line 170 might be operating on stale data if the first update completed. The logic should either refresh before the second update or combine both updates into a single operation to avoid potential data inconsistency.
|
|
||
| sections = [] | ||
| if current_notes: | ||
| sections.append(current_notes) | ||
| sections.append("\n" + header) | ||
| sections.extend(issue_messages) | ||
|
|
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code constructs multi-line strings by joining sections with newlines, but the logic is complex and hard to follow. At line 136, sections might be empty (if current_notes is empty), then line 138 adds "\n" + header (which starts with a newline), and lines 139 extends with issue_messages. This could result in unexpected leading newlines. Consider simplifying this logic or adding comments to clarify the intended format of the final notes string.
| sections = [] | |
| if current_notes: | |
| sections.append(current_notes) | |
| sections.append("\n" + header) | |
| sections.extend(issue_messages) | |
| sections = [] | |
| if current_notes: | |
| # Preserve existing notes, followed by a blank line before the header | |
| sections.append(current_notes) | |
| sections.append("") | |
| sections.append(header) | |
| sections.extend(issue_messages) |
ebe82ba to
79b3b19
Compare
Critical fixes: - Move 're' module import to top level (PEP 8 compliance) - Fix license compatibility check to not split on operators (or/and/with) This was breaking semantic meaning of expressions like 'GPL-3.0-or-later' - Add logging for invalid license expressions instead of silent failures - Improve license file detection with extension validation - Combine license file detection into single loop (performance) Moderate fixes: - Better error handling with debug logging - More precise license file pattern matching - Performance improvement by reducing redundant iterations Signed-off-by: Sahil Lenka <sahillenka44@gmail.com>
79b3b19 to
77a62da
Compare
Implement automated license issue detection and reporting for Nixpkgs packages.
Features:
Implementation:
Detection capabilities:
Output stored in project.extra_data with severity levels (error/warning/info) and automated suggestions for corrections.
Addresses parent issue #1939