Skip to content

Conversation

@danalec
Copy link

@danalec danalec commented Nov 26, 2025

  • Summary

    • Harden Docker runtime (digest pinning, non-root user), expand CI coverage to Node/Python, add baseline linting/static analysis across stacks, pin a ranged Java dependency, and fix versioning script path alignment. Improves security, reproducibility, and code quality without breaking APIs.
  • Changes

    • Docker
      • Switch to eclipse-temurin:17-jre-alpine pinned by digest for reproducibility (java/opendataloader-pdf-cli/Dockerfile:1).
      • Add non-root user with Alpine-compatible commands and switch to it (java/opendataloader-pdf-cli/Dockerfile:4).
    • CI Coverage
      • Keep Java JaCoCo upload (.github/workflows/coverage.yml:24–31).
      • Add Node LCOV via Vitest (node-coverage job present in .github/workflows/coverage.yml).
      • Add Python coverage XML via coverage.py (python-coverage job present in .github/workflows/coverage.yml).
    • Linting/Static Analysis
      • Node: ESLint with TypeScript rules; lint and lint:fix scripts (node/opendataloader-pdf/package.json:18–23), config at node/opendataloader-pdf/.eslintrc.json:1.
      • Python: Ruff and Black configured in python/opendataloader-pdf/pyproject.toml:5–16.
      • Java: Checkstyle and SpotBugs added to core and CLI (java/opendataloader-pdf-core/pom.xml:68–101, java/opendataloader-pdf-cli/pom.xml:46–79) with shared config (java/checkstyle.xml:1).
    • Build Script
      • Fix build-scripts/set_version.py to update python/opendataloader-pdf/setup.py using correct path and regex (build-scripts/set_version.py:7,19–25,33–43,45).
  • Rationale

    • Digest pinning and non-root user tighten supply-chain and runtime security.
    • Full coverage reporting for Node/Python increases visibility in this polyglot repo.
    • Baseline linters and static analysis catch issues earlier and standardize style.
    • Dependency pinning avoids variability from RC range builds.
    • Version script now matches the actual repo layout and release pipeline.
  • Compatibility

    • Java bytecode target remains 11 (java/pom.xml:50–51); running on JRE 17 is compatible.
    • No breaking CLI/API changes; workflows extend coverage only.
    • Docker base change reduces image size and modernizes runtime; entrypoint unchanged.
  • Security

    • Non-root user reduces blast radius.
    • Digest pinning ensures deterministic base image supply chain.
  • Breaking Changes

    • None.
  • Testing

    • CI coverage added for Node/Python; existing Java coverage remains.
    • No new unit tests added; coverage jobs exercise existing tests.
  • Documentation

    • Not required; behavior and interfaces unchanged.
  • Checklist

    • PR title follows Conventional Commits format
    • Documentation has been updated, if necessary
    • Examples have been added, if necessary
    • Tests have been added, if necessary

- Add Docker security best practices by running as non-root user
- Implement linting configurations for Java, Python, and Node projects
- Extend CI workflow to include coverage for Node and Python
- Add Maven checkstyle and spotbugs plugins for Java projects
- Update version management script to handle setup.py
Update from eclipse-temurin:11-jre to 17-jre-alpine with pinned SHA for security. Replace groupadd/useradd with Alpine-compatible addgroup/adduser commands.
@CLAassistant
Copy link

CLAassistant commented Nov 26, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@hnc-leebd hnc-leebd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks a lot for your contribution. really appreciate it!
I fully agree with the direction and improvements you're proposing.

When you have a moment, could you please take a look at the build and test errors and fix them?
Once those are resolved, we should be good to move forward.

Thanks again!

- Update verapdf version range in pom.xml
- Improve Dockerfile permissions handling
- Add version file sync step in release workflow
- Simplify version updates in build steps
- Enhance coverage workflow with matrix testing
- Add ESLint and typechecking to Node.js coverage job
- Update pnpm-lock.yaml with new dependencies
Add OpenContainer image labels for better metadata and specify user group explicitly
Ensure proper cleanup of static resources and containers after PDF processing
to prevent memory leaks and allow sequential processing of multiple files.
Added integration test to verify resource cleanup behavior.
Add escapeHtmlText method to properly escape HTML special characters in generated output. This prevents potential XSS vulnerabilities when rendering user-provided content. The method handles &, <, >, ", ', and / characters, replacing them with their HTML entity equivalents.

Also includes a test case to verify proper escaping of script tags in heading text.
Add escapeMarkdownText method to properly handle special characters in markdown generation, preventing potential XSS and formatting issues. The method replaces characters like <, >, &, and markdown syntax characters with their escaped equivalents.
@danalec
Copy link
Author

danalec commented Nov 27, 2025

XSS hardening for HTML and Markdown outputs and ensures robust resource cleanup after each PDF processed.

Key Changes

  • Resource lifecycle

    • Wraps processFile in try/finally ; closes PDDocument and clears static containers.

    • Closes ContrastRatioConsumer and resets layout/thread-local state after each run.

    • Reference: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/DocumentProcessor.java:49

    • HTML escaping

    • Escapes user-controlled text for headings, paragraphs, captions, and list items.

    • Preserves generator-created tags; only text payloads are escaped.

    • References:

      • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java:156
      • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java:171
      • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java:218–229
      • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/html/HtmlGenerator.java:234
  • Markdown escaping

    • Escapes HTML entity characters and backslash-escapes Markdown control characters ( | * _ # [ ] ( ) !`) in user text.
    • Applied to list items, semantic text nodes, and paragraphs; headings leverage the same path via writeSemanticTextNode .
    • Reference: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java:148,165,206–208,257–279
      Verification
  • Multi-PDF integration test: Processes two PDFs sequentially, asserts PDDocument is null after each run, static containers reset, and files can be deleted (no Windows file locks).

    • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/DocumentProcessorMultiFileIT.java
  • HTML escaping unit test: Confirms <script>alert('x')</script> becomes <script>alert('x')</script> in generated HTML.

    • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/html/HtmlGeneratorEscapingTest.java
  • Markdown escaping unit test: Confirms HTML entity encoding plus Markdown punctuation escaping in output.

    • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/markdown/MarkdownGeneratorEscapingTest.java

@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants