Skip to content

Conversation

@Sahil-u07
Copy link

Implement automated license issue detection and reporting for Nixpkgs packages.

Features:

  • Automatically detect incorrect license declarations
  • Determine correct licenses with confidence scores
  • Apply nixpkgs-specific rules for diverse tech stacks
  • Flag packages and detections needing review
  • Generate comprehensive license reports

Implementation:

  • scanpipe/pipes/nixpkgs.py: Core license analysis utilities (13 functions)
  • scanpipe/pipelines/analyze_nixpkgs_licenses.py: Automated pipeline
  • scanpipe/tests/test_nixpkgs.py: Comprehensive test suite (16 tests)

Detection capabilities:

  • Missing/invalid license declarations
  • License mismatches (declared vs detected)
  • Ecosystem-specific license validation (Python, Rust, Node.js, Go, etc.)
  • Copyleft compliance issues
  • License file inconsistencies
  • Ambiguous detections requiring review

Output stored in project.extra_data with severity levels (error/warning/info) and automated suggestions for corrections.

Addresses parent issue #1939

Copilot AI review requested due to automatic review settings January 18, 2026 16:16
Implement automated license issue detection and reporting for Nixpkgs packages.

Features:
- Automatically detect incorrect license declarations
- Determine correct licenses with confidence scores
- Apply nixpkgs-specific rules for diverse tech stacks
- Flag packages and detections needing review
- Generate comprehensive license reports

Implementation:
- scanpipe/pipes/nixpkgs.py: Core license analysis utilities (13 functions)
- scanpipe/pipelines/analyze_nixpkgs_licenses.py: Automated pipeline
- scanpipe/tests/test_nixpkgs.py: Comprehensive test suite (16 tests)

Detection capabilities:
- Missing/invalid license declarations
- License mismatches (declared vs detected)
- Ecosystem-specific license validation (Python, Rust, Node.js, Go, etc.)
- Copyleft compliance issues
- License file inconsistencies
- Ambiguous detections requiring review

Output stored in project.extra_data with severity levels (error/warning/info)
and automated suggestions for corrections.

Addresses parent issue aboutcode-org#1939

Signed-off-by: Sahil Lenka <sahillenka44@gmail.com>
@Sahil-u07 Sahil-u07 force-pushed the nixpkgs-clarity-license-detection branch from ec2a7fd to 23f3a90 Compare January 18, 2026 16:17
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements automated license issue detection and reporting for Nixpkgs packages. The implementation adds a new pipeline (analyze_nixpkgs_licenses.py), core utility functions (pipes/nixpkgs.py), and a comprehensive test suite to detect incorrect license declarations, determine correct licenses with confidence scores, and apply nixpkgs-specific rules for diverse tech stacks.

Changes:

  • New pipeline for automated Nixpkgs license analysis with 6 custom steps
  • 13 utility functions in nixpkgs.py for license detection, validation, and reporting
  • 16 test cases covering the core functionality

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 19 comments.

File Description
scanpipe/tests/test_nixpkgs.py Comprehensive test suite for nixpkgs license analysis utilities with 16 test cases
scanpipe/pipes/nixpkgs.py Core license analysis utilities including detection, validation, ecosystem checking, and reporting functions
scanpipe/pipelines/analyze_nixpkgs_licenses.py Automated pipeline implementing license scanning, issue detection, flagging, and report generation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 522 to 523
if any(lic in name_lower for lic in ["license", "licence", "copying", "copyright"]):
license_files.append(resource)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'name_lower' variable is assigned but the check uses multiple substrings ('license', 'licence', 'copying', 'copyright'). This will match many false positives like 'copyright_notice.txt' which may not be a license file. The logic on line 522 appears to use the same broad matching. Consider more precise matching patterns or using the resource.is_legal attribute.

Suggested change
if any(lic in name_lower for lic in ["license", "licence", "copying", "copyright"]):
license_files.append(resource)
# Prefer using an explicit legal-file flag when available to avoid false positives.
is_legal = getattr(resource, "is_legal", False)
if is_legal:
license_files.append(resource)
continue
# Fall back to matching common canonical license filenames by base name only.
_base_name = name_lower.split(".", 1)[0]
if _base_name in {"license", "licence", "copying", "copyright"}:
license_files.append(resource)

Copilot uses AI. Check for mistakes.
Comment on lines 410 to 419
NIXPKGS_ECOSYSTEM_LICENSE_PATTERNS = {
"python": ["MIT", "Apache-2.0", "BSD-3-Clause", "GPL-3.0-or-later"],
"rust": ["MIT", "Apache-2.0", "MIT OR Apache-2.0"],
"nodejs": ["MIT", "ISC", "BSD-3-Clause"],
"go": ["MIT", "Apache-2.0", "BSD-3-Clause"],
"haskell": ["BSD-3-Clause", "MIT"],
"ruby": ["MIT", "GPL-2.0-or-later"],
"perl": ["Artistic-2.0", "GPL-1.0-or-later"],
"java": ["Apache-2.0", "MIT", "LGPL-2.1-or-later"],
}
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NIXPKGS_ECOSYSTEM_LICENSE_PATTERNS dictionary maps package types to expected licenses, but the package.type field may not always match these keys exactly. For example, a package type might be 'pypi' (as used in tests) but the dictionary key is 'python'. This mismatch will cause check_nixpkgs_ecosystem_license to always return None for PyPI packages. Either update the dictionary keys to match actual package types or implement a mapping between package types and ecosystem names.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 332
# SPDX-License-Identifier: Apache-2.0
#
# http://nexb.com and https://github.com/aboutcode-org/scancode.io
# The ScanCode.io software is licensed under the Apache License version 2.0.
# Data generated with ScanCode.io is provided as-is without warranties.
# ScanCode is a trademark of nexB Inc.
#
# You may not use this software except in compliance with the License.
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License.
#
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
# for any legal advice.
#
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/aboutcode-org/scancode.io for support and download.

from django.test import TestCase

from scanpipe.models import CodebaseResource
from scanpipe.models import DiscoveredPackage
from scanpipe.models import Project
from scanpipe.pipes import nixpkgs


class NixpkgsLicenseAnalysisTest(TestCase):
def setUp(self):
self.project = Project.objects.create(name="Test Nixpkgs Project")

def test_detect_missing_declared_license(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
)

issues = nixpkgs.detect_package_license_issues(package)

self.assertEqual(len(issues), 1)
self.assertEqual(issues[0]["type"], "missing_declared_license")
self.assertEqual(issues[0]["severity"], "error")

def test_detect_license_mismatch(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="MIT",
)

# Create a resource with different detected license
resource = CodebaseResource.objects.create(
project=self.project,
path="test.py",
detected_license_expression="GPL-3.0-only",
)
package.codebase_resources.add(resource)

issues = nixpkgs.detect_package_license_issues(package)

# Should detect mismatch
mismatch_issues = [i for i in issues if i["type"] == "license_mismatch"]
self.assertTrue(len(mismatch_issues) > 0)
self.assertEqual(mismatch_issues[0]["severity"], "warning")

def test_normalize_nixpkgs_license(self):
# Test common nixpkgs license mappings
self.assertEqual(
nixpkgs.normalize_nixpkgs_license("gpl2+"),
"GPL-2.0-or-later"
)
self.assertEqual(
nixpkgs.normalize_nixpkgs_license("apache2"),
"Apache-2.0"
)
self.assertEqual(
nixpkgs.normalize_nixpkgs_license("mit"),
"MIT"
)

# Test unknown license passes through
self.assertEqual(
nixpkgs.normalize_nixpkgs_license("CustomLicense"),
"CustomLicense"
)

def test_check_nixpkgs_ecosystem_license(self):
# Python package with unusual license
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="Artistic-2.0",
)

issue = nixpkgs.check_nixpkgs_ecosystem_license(package)

self.assertIsNotNone(issue)
self.assertEqual(issue["type"], "unusual_ecosystem_license")
self.assertEqual(issue["severity"], "info")

def test_check_nixpkgs_ecosystem_license_typical(self):
# Python package with typical license
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="MIT",
)

issue = nixpkgs.check_nixpkgs_ecosystem_license(package)

# No issue expected for typical license
self.assertIsNone(issue)

def test_detect_copyleft_with_dependencies(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="GPL-3.0-only",
)

issues = nixpkgs.detect_copyleft_compliance_issues(package)

# Note: This test would need dependencies to trigger the issue
# For now, it should return empty list
self.assertIsInstance(issues, list)

def test_suggest_license_correction_single_detected(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="GPL-3.0-only",
)

detected_licenses = ["MIT"]
suggestion = nixpkgs.suggest_license_correction(package, detected_licenses)

self.assertIsNotNone(suggestion)
self.assertEqual(suggestion["suggested_license"], "MIT")
self.assertEqual(suggestion["confidence"], "high")

def test_suggest_license_correction_multiple_detected(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="GPL-3.0-only",
)

detected_licenses = ["MIT", "Apache-2.0"]
suggestion = nixpkgs.suggest_license_correction(package, detected_licenses)

self.assertIsNotNone(suggestion)
self.assertIn("OR", suggestion["suggested_license"])
self.assertEqual(suggestion["confidence"], "medium")

def test_suggest_license_correction_no_detected(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="MIT",
)

suggestion = nixpkgs.suggest_license_correction(package, [])

self.assertIsNone(suggestion)

def test_analyze_license_issues(self):
# Create package with issue
DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="package-with-issue",
version="1.0.0",
# No declared license - should trigger issue
)

# Create package without issue
DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="package-without-issue",
version="1.0.0",
declared_license_expression="MIT",
)

issues = nixpkgs.analyze_license_issues(self.project)

# Should have issues for one package
self.assertEqual(len(issues), 1)

def test_generate_license_report(self):
# Create package with issue
DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="package1",
version="1.0.0",
)

# Create package without issue
DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="package2",
version="1.0.0",
declared_license_expression="MIT",
)

report = nixpkgs.generate_license_report(self.project)

self.assertIn("summary", report)
self.assertIn("by_severity", report)
self.assertIn("by_type", report)
self.assertIn("issues_by_package", report)

summary = report["summary"]
self.assertEqual(summary["total_packages"], 2)
self.assertEqual(summary["packages_with_issues"], 1)
self.assertEqual(summary["packages_without_issues"], 1)

def test_check_license_clarity_unclear_license(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="Unknown",
)

issues = nixpkgs.check_license_clarity(package)

unclear_issues = [i for i in issues if i["type"] == "unclear_license"]
self.assertTrue(len(unclear_issues) > 0)
self.assertEqual(unclear_issues[0]["severity"], "warning")

def test_get_detected_licenses_for_package(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
)

# Create resources with licenses
resource1 = CodebaseResource.objects.create(
project=self.project,
path="file1.py",
detected_license_expression="MIT",
)
resource2 = CodebaseResource.objects.create(
project=self.project,
path="file2.py",
detected_license_expression="Apache-2.0",
)

package.codebase_resources.add(resource1, resource2)

detected = nixpkgs.get_detected_licenses_for_package(package)

self.assertEqual(len(detected), 2)
self.assertIn("MIT", detected)
self.assertIn("Apache-2.0", detected)

def test_detect_license_file_issues_multiple_files(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="MIT",
)

# Create multiple license files
license1 = CodebaseResource.objects.create(
project=self.project,
path="LICENSE",
name="LICENSE",
)
license2 = CodebaseResource.objects.create(
project=self.project,
path="COPYING",
name="COPYING",
)

package.codebase_resources.add(license1, license2)

issues = nixpkgs.detect_license_file_issues(package)

multiple_file_issues = [
i for i in issues if i["type"] == "multiple_license_files"
]
self.assertTrue(len(multiple_file_issues) > 0)

def test_are_licenses_compatible_exact_match(self):
from licensedcode.cache import get_licensing

licensing = get_licensing()
lic1 = licensing.parse("MIT", validate=True)
lic2 = licensing.parse("MIT", validate=True)

result = nixpkgs.are_licenses_compatible(lic1, lic2, licensing)

self.assertTrue(result)

def test_are_licenses_compatible_different(self):
from licensedcode.cache import get_licensing

licensing = get_licensing()
lic1 = licensing.parse("MIT", validate=True)
lic2 = licensing.parse("GPL-3.0-only", validate=True)

result = nixpkgs.are_licenses_compatible(lic1, lic2, licensing)

self.assertFalse(result)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file is placed in the wrong directory. Based on the codebase structure, tests for pipe modules (scanpipe/pipes/*.py) should be located in scanpipe/tests/pipes/ directory, not directly in scanpipe/tests/. This test file should be moved to scanpipe/tests/pipes/test_nixpkgs.py to match the convention used for other pipe tests like test_scancode.py, test_compliance.py, etc.

Copilot uses AI. Check for mistakes.
Comment on lines 272 to 274
if resource.is_legal or 'license' in resource.name.lower():
has_license_file = True
break
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The license file detection logic is potentially unreliable. Checking if 'license' is in resource.name.lower() will match files like 'unlicensed.txt' or 'license_check.py' which are not actual license files. Additionally, using resource.is_legal would be more accurate as it's specifically designed to identify legal files. Consider using resource.is_legal as the primary check, with the name check as a fallback.

Suggested change
if resource.is_legal or 'license' in resource.name.lower():
has_license_file = True
break
# Prefer the dedicated legal/notice flag, then fall back to common license filenames.
if resource.is_legal:
has_license_file = True
break
resource_name = resource.name.lower()
if (
resource_name in {
"license",
"license.txt",
"license.md",
"copying",
"copying.txt",
"copying.md",
}
or resource_name.startswith("license.")
or resource_name.startswith("copying.")
):
has_license_file = True
break

Copilot uses AI. Check for mistakes.
Comment on lines +209 to +238
def test_generate_license_report(self):
# Create package with issue
DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="package1",
version="1.0.0",
)

# Create package without issue
DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="package2",
version="1.0.0",
declared_license_expression="MIT",
)

report = nixpkgs.generate_license_report(self.project)

self.assertIn("summary", report)
self.assertIn("by_severity", report)
self.assertIn("by_type", report)
self.assertIn("issues_by_package", report)

summary = report["summary"]
self.assertEqual(summary["total_packages"], 2)
self.assertEqual(summary["packages_with_issues"], 1)
self.assertEqual(summary["packages_without_issues"], 1)

Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test lacks coverage for error conditions and edge cases. For example, there are no tests for: (1) handling invalid license expressions in detected licenses, (2) behavior when package.codebase_resources is empty, (3) handling of packages with no type field, and (4) performance with large numbers of resources. Consider adding tests for these scenarios to ensure robustness.

Copilot uses AI. Check for mistakes.
Comment on lines 152 to 158
# Add suggestion to notes
suggestion_note = (
f"\nSuggested license: {suggestion['suggested_license']} "
f"(confidence: {suggestion['confidence']})\n"
f"Reason: {suggestion['reason']}"
)
package.update(notes=package.notes + suggestion_note)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous issue, the suggestion note is appended without checking if it already exists. Running the pipeline multiple times will create duplicate suggestion entries in the package notes. Consider implementing idempotent note updates.

Suggested change
# Add suggestion to notes
suggestion_note = (
f"\nSuggested license: {suggestion['suggested_license']} "
f"(confidence: {suggestion['confidence']})\n"
f"Reason: {suggestion['reason']}"
)
package.update(notes=package.notes + suggestion_note)
# Add suggestion to notes in an idempotent way
suggestion_note = (
f"\nSuggested license: {suggestion['suggested_license']} "
f"(confidence: {suggestion['confidence']})\n"
f"Reason: {suggestion['reason']}"
)
current_notes = package.notes or ""
if suggestion_note not in current_notes:
if current_notes:
new_notes = current_notes + suggestion_note
else:
# Strip leading newline if there were no previous notes
new_notes = suggestion_note.lstrip("\n")
package.update(notes=new_notes)

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +85
@classmethod
def steps(cls):
return (
cls.copy_inputs_to_codebase_directory,
cls.extract_archives,
cls.collect_and_create_codebase_resources,
cls.flag_empty_files,
cls.flag_ignored_resources,
cls.scan_for_application_packages,
cls.scan_for_files,
cls.collect_and_create_license_detections,
cls.analyze_nixpkgs_license_issues,
cls.flag_packages_with_license_issues,
cls.flag_license_detections_needing_review,
cls.generate_nixpkgs_license_report,
)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline references steps that are not defined in the class or its parent: 'copy_inputs_to_codebase_directory', 'collect_and_create_codebase_resources', 'scan_for_application_packages', 'scan_for_files', and 'collect_and_create_license_detections'. While these methods exist in the ScanCodebase pipeline, AnalyzeNixpkgsLicenses inherits from Pipeline (not ScanCodebase), so these methods are not available. Either inherit from ScanCodebase instead of Pipeline, or implement these methods in this class.

Copilot uses AI. Check for mistakes.

# Should detect mismatch
mismatch_issues = [i for i in issues if i["type"] == "license_mismatch"]
self.assertTrue(len(mismatch_issues) > 0)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertTrue(a > b) cannot provide an informative message. Using assertGreater(a, b) instead will give more informative messages.

Suggested change
self.assertTrue(len(mismatch_issues) > 0)
self.assertGreater(len(mismatch_issues), 0)

Copilot uses AI. Check for mistakes.
issues = nixpkgs.check_license_clarity(package)

unclear_issues = [i for i in issues if i["type"] == "unclear_license"]
self.assertTrue(len(unclear_issues) > 0)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertTrue(a > b) cannot provide an informative message. Using assertGreater(a, b) instead will give more informative messages.

Suggested change
self.assertTrue(len(unclear_issues) > 0)
self.assertGreater(len(unclear_issues), 0)

Copilot uses AI. Check for mistakes.
multiple_file_issues = [
i for i in issues if i["type"] == "multiple_license_files"
]
self.assertTrue(len(multiple_file_issues) > 0)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertTrue(a > b) cannot provide an informative message. Using assertGreater(a, b) instead will give more informative messages.

Suggested change
self.assertTrue(len(multiple_file_issues) > 0)
self.assertGreater(len(multiple_file_issues), 0)

Copilot uses AI. Check for mistakes.
- Move test file to correct location (scanpipe/tests/pipes/)
- Add package type to ecosystem mapping (pypi->python, npm->nodejs, etc.)
- Improve license file detection with precise matching
- Fix pipeline inheritance (now extends ScanCodebase)
- Add database query optimization with prefetch_related
- Use exact matching for package lookup instead of contains
- Make notes updates idempotent to prevent duplicates
- Use assertGreater instead of assertTrue for better test output
- Improve license file detection to avoid false positives

Signed-off-by: Sahil-u07 <sahilbagul27@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 477 to 479
for expected in expected_licenses:
if expected.lower() in declared.lower():
return None
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the are_licenses_compatible function, this check uses simple substring matching with expected.lower() in declared.lower(). This could result in false matches. For example, if expected is "MIT", it would match "LIMITED" or "SUBMITTED". Consider using more precise matching that respects word boundaries or validates against parsed license expressions.

Copilot uses AI. Check for mistakes.
Comment on lines 521 to 525
proprietary_indicators = ["proprietary", "commercial", "closed"]
description = (package.description or "").lower()
notes = (package.notes or "").lower()

if any(ind in description or ind in notes for ind in proprietary_indicators):
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other substring matching issues, the proprietary indicators check uses simple substring matching (e.g., "commercial" in description). This could match unintended words like "uncommercial", "noncommercial", or "commercial-free". Consider using more precise matching with word boundaries or regex patterns to avoid false positives.

Copilot uses AI. Check for mistakes.
Comment on lines 218 to 229
# Get license detections that need review from the package's project
ambiguous_licenses = package.project.discoveredlicenses.filter(
needs_review=True
)

# Get file regions related to this package
package_paths = set(package.codebase_resources.values_list('path', flat=True))

for lic in ambiguous_licenses:
# Check if this detection appears in package files
detection_paths = {fr.get('path') for fr in lic.file_regions}
if package_paths & detection_paths:
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check_ambiguous_detections function queries all license detections with needs_review=True for the entire project, then filters them by checking if they intersect with package paths. For projects with many packages and many license detections, this could be inefficient as it retrieves all ambiguous licenses even if most don't relate to the current package. Consider optimizing by filtering discoveredlicenses to only those whose file_regions intersect with the package's resources, possibly by using a database query if the file_regions field structure supports it.

Copilot uses AI. Check for mistakes.
Comment on lines 261 to 262
for indicator in unclear_indicators:
if indicator in declared_lower:
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unclear license detection uses simple substring matching (e.g., "other" in declared_lower). This could lead to false positives. For example, "other" would match "another", "brother", "otherwise", etc. Similarly, "free" would match "freetype", "freestyle", etc. Consider using more precise matching such as word boundary checks or splitting the license expression on known delimiters before checking for these terms.

Copilot uses AI. Check for mistakes.
Comment on lines 503 to 504
copyleft_indicators = ["gpl", "agpl", "lgpl", "mpl", "epl", "cpl"]
is_copyleft = any(ind in declared_lower for ind in copyleft_indicators)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyleft detection uses simple substring matching (e.g., "gpl" in declared_lower). This could lead to false positives where "gpl" appears as part of other words. For example, "gpl" could match "single-player", or "cpl" could match "compliance". Consider using more precise matching such as word boundaries or validating against parsed license expressions to avoid false matches.

Copilot uses AI. Check for mistakes.
Comment on lines +167 to +169
current_notes = package.notes or ""
if suggestion_note not in current_notes:
package.update(notes=current_notes + suggestion_note)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current_notes variable is retrieved from package.notes on line 167, but this happens after the package notes were already updated on line 143. This means current_notes will contain the newly updated notes (including the "=== License Issues ===" section), not the original notes. The suggestion_note could be appended to the wrong location or create duplicates. Consider refetching the package or using the new_notes variable from line 142 instead.

Copilot uses AI. Check for mistakes.
Comment on lines 554 to 555
base_name = name_lower.split(".", 1)[0]
if base_name in {"license", "licence", "copying", "copyright"}:
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's inconsistency in how license files are detected between check_license_clarity (lines 274-285) and detect_license_file_issues (lines 546-556). In check_license_clarity, the code checks for exact matches like "license", "license.txt", "license.md" as well as files starting with "license." or "copying.". However, in detect_license_file_issues, it only checks the base name (before the first dot) against a set. This means detect_license_file_issues would not match files like "license.txt" or "license.md" since their base name is just "license", which would match. But it would miss files like "license-mit.txt" which check_license_clarity would catch. Consider using a consistent approach in both functions.

Suggested change
base_name = name_lower.split(".", 1)[0]
if base_name in {"license", "licence", "copying", "copyright"}:
if (
# Exact common license filenames
name_lower in {
"license",
"license.txt",
"license.md",
"licence",
"licence.txt",
"licence.md",
"copying",
"copying.txt",
"copying.md",
"copyright",
"copyright.txt",
"copyright.md",
}
# Files starting with common license-related prefixes
or name_lower.startswith(
(
"license.",
"license-",
"licence.",
"licence-",
"copying.",
"copying-",
"copyright.",
"copyright-",
)
)
):

Copilot uses AI. Check for mistakes.

for lic in ambiguous_licenses:
# Check if this detection appears in package files
detection_paths = {fr.get('path') for fr in lic.file_regions}
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file_regions field access pattern assumes file_regions is a list of dictionaries with a 'path' key. However, if file_regions has a different structure or if the 'path' key is missing, this will silently return None values in the set. Consider using fr.get('path') with a check for None values, or validate the structure of file_regions before processing to ensure more robust error handling.

Suggested change
detection_paths = {fr.get('path') for fr in lic.file_regions}
detection_paths = {
fr_path
for fr in (lic.file_regions or [])
if isinstance(fr, dict)
for fr_path in [fr.get("path")]
if fr_path is not None
}

Copilot uses AI. Check for mistakes.
Comment on lines 204 to 206
# Check if one contains the other (e.g., for compound expressions)
if detected_str in declared_str or declared_str in detected_str:
return True
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The license compatibility check uses simple string containment (e.g., "mit" in "limited"). This could lead to false positives where a license identifier is accidentally found as a substring of another word. For example, "mit" could match "submitted", "limited", etc. Consider using more precise matching such as splitting on common delimiters (spaces, dashes, OR, AND) or using word boundaries to avoid partial matches.

Copilot uses AI. Check for mistakes.

from django.db import models

from scanpipe.pipelines import Pipeline
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Pipeline' is not used.

Suggested change
from scanpipe.pipelines import Pipeline

Copilot uses AI. Check for mistakes.
- Use word boundary regex matching for all substring checks
  (fixes false positives like 'mit' matching 'limited')
- Consistent license file detection across functions
- Better error handling for file_regions structure
- Fix current_notes being fetched after package update
- Separate merged test methods into distinct tests
- Remove unused Pipeline import
- Use regex word boundaries for:
  * License compatibility checks
  * Unclear license indicators
  * Ecosystem license patterns
  * Copyleft detection
  * Proprietary indicators

All substring matching now uses \\b word boundaries to prevent
false matches while maintaining functionality.

Signed-off-by: Sahil-u07 <sahilbagul27@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 21 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +138
def test_detect_copyleft_with_dependencies(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="GPL-3.0-only",
)

issues = nixpkgs.detect_copyleft_compliance_issues(package)

# Note: This test would need dependencies to trigger the issue
# For now, it should return empty list
self.assertIsInstance(issues, list)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for detect_copyleft_compliance_issues only checks that it returns a list, but doesn't verify any actual copyleft detection scenarios. The test comment at line 136 acknowledges this ("This test would need dependencies to trigger the issue"). Consider adding test cases that actually create dependencies and verify that copyleft issues are properly detected, or at minimum test the proprietary conflict detection path (lines 560-566 in nixpkgs.py).

Copilot uses AI. Check for mistakes.
Comment on lines +239 to +252
def test_check_license_clarity_unclear_license(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package-unclear",
version="1.0.0",
declared_license_expression="Unknown",
)

issues = nixpkgs.check_license_clarity(package)

unclear_issues = [i for i in issues if i["type"] == "unclear_license"]
self.assertGreater(len(unclear_issues), 0)
self.assertEqual(unclear_issues[0]["severity"], "warning")
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the missing_license_file scenario in check_license_clarity (lines 306-312 in nixpkgs.py). The test suite should verify that when a package has a declared_license_expression but no license file, the appropriate info-level issue is generated.

Copilot uses AI. Check for mistakes.
declared_lower = package.declared_license_expression.lower()

# Check for copyleft licenses using word boundary matching
import re
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement is placed inside the function instead of at the module level. This violates Python best practices and PEP 8 guidelines. The 're' module should be imported at the top of the file.

Copilot uses AI. Check for mistakes.
Comment on lines 204 to 217
# Split on common license expression delimiters for more precise matching
import re
declared_parts = set(re.split(r'[\s\-_()]+|\bor\b|\band\b|\bwith\b', declared_str))
detected_parts = set(re.split(r'[\s\-_()]+|\bor\b|\band\b|\bwith\b', detected_str))

# Remove empty strings from split
declared_parts.discard('')
detected_parts.discard('')

# Check if detected parts are subset of declared or vice versa
if detected_parts and declared_parts:
if detected_parts.issubset(declared_parts) or declared_parts.issubset(detected_parts):
return True

Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This license compatibility check has a significant logical flaw. The regex pattern splits on word boundaries including common license operators like 'or', 'and', and 'with', which breaks the semantic meaning of license expressions. For example, "GPL-3.0-or-later" would be split into parts including "or", treating the operator as a separate component. This could lead to false positives where incompatible licenses are considered compatible because they share common words like "or" or "and".

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +244
detection_paths = {
fr_path
for fr in (lic.file_regions or [])
if isinstance(fr, dict)
for fr_path in [fr.get("path")]
if fr_path is not None
}
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function uses a set comprehension with nested list comprehensions that could raise exceptions if the data structure doesn't match expectations. Specifically, if 'file_regions' is None or if individual items in it aren't dictionaries, the code will fail. Add defensive checks to handle cases where file_regions might be None or contain non-dict items.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +71
def test_detect_license_mismatch(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="MIT",
)

# Create a resource with different detected license
resource = CodebaseResource.objects.create(
project=self.project,
path="test.py",
detected_license_expression="GPL-3.0-only",
)
package.codebase_resources.add(resource)

issues = nixpkgs.detect_package_license_issues(package)

# Should detect mismatch
mismatch_issues = [i for i in issues if i["type"] == "license_mismatch"]
self.assertGreater(len(mismatch_issues), 0)
self.assertEqual(mismatch_issues[0]["severity"], "warning")
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the find_license_inconsistencies function when the declared license is invalid (lines 158-168 in nixpkgs.py). The test suite should include a case where the declared_license expression fails to parse, to verify that appropriate error issues are returned with the correct severity and message.

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +310
def test_detect_license_file_issues_multiple_files(self):
package = DiscoveredPackage.objects.create(
project=self.project,
type="pypi",
name="test-package",
version="1.0.0",
declared_license_expression="MIT",
)

# Create multiple license files
license1 = CodebaseResource.objects.create(
project=self.project,
path="LICENSE",
name="LICENSE",
)
license2 = CodebaseResource.objects.create(
project=self.project,
path="COPYING",
name="COPYING",
)

package.codebase_resources.add(license1, license2)

issues = nixpkgs.detect_license_file_issues(package)

multiple_file_issues = [
i for i in issues if i["type"] == "multiple_license_files"
]
self.assertGreater(len(multiple_file_issues), 0)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the license_file_without_declaration scenario in detect_license_file_issues (lines 617-625 in nixpkgs.py). The test suite should verify that when license files exist but no declared_license_expression is present, the appropriate warning-level issue is generated.

Copilot uses AI. Check for mistakes.

# Check if license is too generic or unclear
if package.declared_license_expression:
import re
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement is placed inside the function instead of at the module level. This violates Python best practices and PEP 8 guidelines. The 're' module should be imported at the top of the file.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +170
package.update(notes=new_notes)

# Get detected licenses for this package
detected_licenses = nixpkgs.get_detected_licenses_for_package(package)

# Try to suggest correction if declared license is wrong
if detected_licenses:
suggestion = nixpkgs.suggest_license_correction(
package,
detected_licenses
)
if suggestion and package.declared_license_expression:
if suggestion["suggested_license"] != package.declared_license_expression:
self.log(
f"Package {package}: suggested license "
f"'{suggestion['suggested_license']}' "
f"(confidence: {suggestion['confidence']})"
)
# Add suggestion to notes (idempotent) - use the already updated notes
suggestion_note = (
f"\nSuggested license: {suggestion['suggested_license']} "
f"(confidence: {suggestion['confidence']})\n"
f"Reason: {suggestion['reason']}"
)
# Refresh package from DB to get latest notes after previous update
package.refresh_from_db()
current_notes = package.notes or ""
if suggestion_note not in current_notes:
package.update(notes=current_notes + suggestion_note)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation has a potential race condition or stale data issue. After updating the package notes at line 142, the code then calls refresh_from_db() at line 167, but only if a suggestion exists. This means the second update at line 170 might be operating on stale data if the first update completed. The logic should either refresh before the second update or combine both updates into a single operation to avoid potential data inconsistency.

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +140

sections = []
if current_notes:
sections.append(current_notes)
sections.append("\n" + header)
sections.extend(issue_messages)

Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code constructs multi-line strings by joining sections with newlines, but the logic is complex and hard to follow. At line 136, sections might be empty (if current_notes is empty), then line 138 adds "\n" + header (which starts with a newline), and lines 139 extends with issue_messages. This could result in unexpected leading newlines. Consider simplifying this logic or adding comments to clarify the intended format of the final notes string.

Suggested change
sections = []
if current_notes:
sections.append(current_notes)
sections.append("\n" + header)
sections.extend(issue_messages)
sections = []
if current_notes:
# Preserve existing notes, followed by a blank line before the header
sections.append(current_notes)
sections.append("")
sections.append(header)
sections.extend(issue_messages)

Copilot uses AI. Check for mistakes.
@Sahil-u07 Sahil-u07 force-pushed the nixpkgs-clarity-license-detection branch from ebe82ba to 79b3b19 Compare January 18, 2026 17:33
Critical fixes:
- Move 're' module import to top level (PEP 8 compliance)
- Fix license compatibility check to not split on operators (or/and/with)
  This was breaking semantic meaning of expressions like 'GPL-3.0-or-later'
- Add logging for invalid license expressions instead of silent failures
- Improve license file detection with extension validation
- Combine license file detection into single loop (performance)

Moderate fixes:
- Better error handling with debug logging
- More precise license file pattern matching
- Performance improvement by reducing redundant iterations

Signed-off-by: Sahil Lenka <sahillenka44@gmail.com>
@Sahil-u07 Sahil-u07 force-pushed the nixpkgs-clarity-license-detection branch from 79b3b19 to 77a62da Compare January 18, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant