-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: add database output support for storing scan results #2363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Add support for storing httpx scan results directly to databases with both CLI flags and YAML config file options. Supported databases: - MongoDB - PostgreSQL - MySQL Features: - Batched writes for performance (configurable batch size) - Auto-flush with configurable interval - Individual columns for each Result field (not JSON blob) - Support for environment variable HTTPX_DB_CONNECTION_STRING - Option to omit raw request/response data (-rdbor) New CLI flags under OUTPUT group: - -rdb, -result-db: enable database storage - -rdbc, -result-db-config: path to YAML config file - -rdbt, -result-db-type: database type (mongodb, postgres, mysql) - -rdbcs, -result-db-conn: connection string - -rdbn, -result-db-name: database name (default: httpx) - -rdbtb, -result-db-table: table/collection name (default: results) - -rdbbs, -result-db-batch-size: batch size (default: 100) - -rdbor, -result-db-omit-raw: omit raw data Closes #1973 Closes #2360 Closes #2361 Closes #2362
WalkthroughAdds pluggable database result storage (MongoDB, PostgreSQL, MySQL) with CLI flags, config loading, a registry/factory, batched writer, schema management, and runner integration to persist scan results. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant Runner as httpx Runner
participant Writer as DB Writer (callback)
participant Batcher as Batcher Loop
participant DB as Database (Mongo/Postgres/MySQL)
CLI->>Runner: start with --result-db / config
Runner->>Writer: NewWriter(ctx, cfg) / register OnResult & OnClose
loop each scan result
Runner->>Writer: OnResult(result)
Writer->>Writer: enqueue to channel (non-blocking)
end
alt background flush triggers
Batcher->>Writer: drain channel (batch)
Batcher->>DB: InsertBatch(results)
DB-->>Batcher: ok / error
Batcher->>Writer: update counters / logs
end
CLI->>Runner: finish / interrupt
Runner->>Writer: Close()
Writer->>Batcher: cancel, flush remaining
Batcher->>DB: final InsertBatch
Writer->>DB: Close connection
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI Agents
In @internal/db/mysql.go:
- Around line 47-152: EnsureSchema currently injects m.cfg.TableName directly
into the CREATE TABLE SQL; validate or safely quote the identifier before
interpolation to prevent SQL injection (e.g., enforce a safe regex like
^[A-Za-z0-9_]+$ on m.cfg.TableName and/or wrap the identifier in backticks after
rejecting/escaping any backticks). Update mysqlDatabase.EnsureSchema to use the
validated/quoted table name variable when building the schema string (replace
direct use of m.cfg.TableName). Also add a matching index on url (e.g., add
`INDEX idx_url (url)`) to the CREATE TABLE statement for parity with the
Postgres schema.
In @internal/db/postgres.go:
- Around line 48-161: The SQL builds in the schema variable interpolate
p.cfg.TableName directly (and into index names like idx_%s_timestamp and later
INSERT queries), which risks SQL injection; update the code to quote identifiers
using pq.QuoteIdentifier(p.cfg.TableName) (or an equivalent safe
identifier-quoting helper) everywhere the table name or derived index names are
inserted (including schema, index names, and any INSERT/SELECT/DROP/ALTER
statements) so that the table and index identifiers are safely escaped before
passing into fmt.Sprintf.
In @internal/db/writer.go:
- Around line 130-137: The Close method races with sends from the callback
returned by GetWriterCallback: after calling w.cancel() but before close(w.data)
a resumed callback can see both w.ctx.Done() and w.data ready and attempt to
send into a closed channel, causing a panic; update GetWriterCallback (the
sending goroutine) to check w.closed (or use an atomic/boolean helper like
w.isClosed()) before selecting/send and bail out if closed, or instead change
Close to close w.data before/atomically with cancellation (or use a mutex to
synchronize) so that the callback never attempts a send after closure; modify
either Writer.Close, GetWriterCallback, or both to perform this
closed-check/synchronization using the existing w.closed, w.cancel, w.data, and
w.ctx symbols.
🧹 Nitpick comments (12)
internal/db/config.go (1)
39-49: Consider stricter validation for numeric fields.The current validation only checks database type and connection string. BatchSize and FlushInterval are validated indirectly through ApplyDefaults (which replaces non-positive values with defaults). While this works, it means explicitly setting negative values gets silently corrected rather than rejected.
Consider adding explicit validation to reject negative values as invalid input, which would make configuration errors more apparent to users.
🔎 Example stricter validation
func (c *Config) Validate() error { if !c.Type.IsValid() { return fmt.Errorf("invalid database type: %s (supported: %v)", c.Type, SupportedDatabases()) } if c.ConnectionString == "" { return fmt.Errorf("connection string is required") } + + if c.BatchSize < 0 { + return fmt.Errorf("batch size cannot be negative") + } + + if c.FlushInterval < 0 { + return fmt.Errorf("flush interval cannot be negative") + } return nil }internal/db/mongodb.go (1)
115-127: Consider direct BSON marshaling for efficiency.The JSON round-trip (marshal to JSON, then unmarshal to BSON) is simple and reliable but adds overhead. For scanning workloads with moderate throughput, this approach is acceptable and prioritizes maintainability.
If performance profiling reveals this as a bottleneck, consider using BSON tags on the runner.Result struct and marshaling directly to BSON.
internal/db/writer.go (2)
67-71: Consider also omittingRawHeaderswhenomitRawis enabled.The
omitRawoption clearsRaw,Request, andResponseBody, butRawHeaders(which contains the raw HTTP response headers) is not cleared. If the intent is to reduce storage of raw HTTP data, this field should likely be included.🔎 Proposed fix
if w.omitRaw { r.Raw = "" r.Request = "" r.ResponseBody = "" + r.RawHeaders = "" }
96-101: Failed batch data is discarded without retry.When
InsertBatchfails, the error is logged but the batch is cleared at line 103, permanently losing those results. For reliability, consider implementing a retry mechanism or dead-letter queue for failed batches.internal/db/db.go (2)
22-29:IsValid()andSupportedDatabases()may diverge if new types are added.Both functions enumerate the valid database types independently. If a new database type is added, both must be updated. Consider deriving one from the other to ensure consistency.
🔎 Proposed fix
+var supportedDatabases = []DatabaseType{MongoDB, PostgreSQL, MySQL} + func (d DatabaseType) IsValid() bool { - switch d { - case MongoDB, PostgreSQL, MySQL: - return true - default: - return false + for _, db := range supportedDatabases { + if d == db { + return true + } } + return false } func SupportedDatabases() []DatabaseType { - return []DatabaseType{MongoDB, PostgreSQL, MySQL} + return supportedDatabases }Also applies to: 68-70
43-49: Registry map lacks synchronization, but safe given init-only writes.The
registrymap is written to only duringinit()calls, which are guaranteed to run sequentially beforemain(). This is safe, but adding a comment documenting this assumption would help future maintainers.runner/options.go (1)
680-801: No validation added for database options.
ValidateOptions()doesn't validate the new database-related flags. Consider adding validation to ensure:
- If
-rdbis enabled, either-rdbcor (-rdbt+-rdbcs) must be provided- Database type is one of the supported values
internal/db/postgres.go (2)
218-227: JSON marshaling errors are silently ignored.All
json.Marshalcalls ignore the error return value. While marshaling struct types rarely fails, nil pointer dereferences or customMarshalJSONimplementations could fail. Consider at minimum logging errors or using a helper that returns[]byte("null")on failure.🔎 Proposed helper pattern
func mustMarshalJSON(v interface{}) []byte { data, err := json.Marshal(v) if err != nil { return []byte("null") } return data }
171-252: Consider using COPY for better batch insert performance.For PostgreSQL, using
COPYprotocol (viapq.CopyIn) is significantly faster for bulk inserts compared to prepared statement execution in a loop. This could improve performance for large batch sizes.internal/db/mysql.go (3)
56-60: VARCHAR length constraints may cause data truncation.Several VARCHAR columns have fixed limits that may truncate real-world data:
host VARCHAR(255)- most hostnames fit, but some edge cases existmethod VARCHAR(10)- custom methods could exceed thisport VARCHAR(10)- reasonablescheme VARCHAR(10)- reasonableConsider using
TEXTformethodor increasing the limit to handle custom HTTP methods.
209-227: JSON marshaling errors silently ignored.Same issue as PostgreSQL implementation - all
json.Marshalerrors are discarded. Apply the same fix pattern as suggested for postgres.go.
162-252: Consider multi-value INSERT for better MySQL batch performance.MySQL supports multi-value INSERT syntax (
INSERT INTO t VALUES (...), (...), ...) which is faster than executing prepared statements in a loop. This could significantly improve batch insert performance.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
README.mdcmd/httpx/httpx.gogo.modinternal/db/config.gointernal/db/db.gointernal/db/mongodb.gointernal/db/mysql.gointernal/db/postgres.gointernal/db/writer.gorunner/options.go
🧰 Additional context used
🧬 Code graph analysis (5)
internal/db/writer.go (4)
internal/db/db.go (2)
Database(31-41)NewDatabase(51-66)internal/db/config.go (1)
Config(23-37)runner/types.go (1)
Result(35-105)runner/options.go (1)
OnResultCallback(51-51)
internal/db/postgres.go (5)
internal/db/db.go (4)
Register(47-49)PostgreSQL(14-14)Database(31-41)DatabaseType(10-10)internal/db/config.go (1)
Config(23-37)runner/types.go (2)
Result(35-105)Trace(107-123)common/httpx/csp.go (1)
CSPData(25-28)common/httpx/proto.go (1)
HTTP2(8-8)
internal/db/db.go (2)
runner/types.go (1)
Result(35-105)internal/db/config.go (1)
Config(23-37)
internal/db/mysql.go (5)
internal/db/db.go (4)
Register(47-49)MySQL(15-15)Database(31-41)DatabaseType(10-10)internal/db/config.go (1)
Config(23-37)runner/types.go (2)
Result(35-105)Trace(107-123)common/httpx/csp.go (1)
CSPData(25-28)common/httpx/proto.go (1)
HTTP2(8-8)
internal/db/config.go (1)
internal/db/db.go (2)
DatabaseType(10-10)SupportedDatabases(68-70)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Functional Test (macOS-latest)
- GitHub Check: Functional Test (windows-latest)
- GitHub Check: Functional Test (ubuntu-latest)
- GitHub Check: Lint Test
- GitHub Check: Analyze (go)
- GitHub Check: release-test
🔇 Additional comments (19)
README.md (2)
100-166: LGTM - Documentation alignment improvements.The cosmetic adjustments to spacing and alignment improve readability and consistency across the CLI help sections.
206-213: LGTM - Database output flags properly documented.The new database-related flags (
-rdb,-rdbc,-rdbt, etc.) are clearly documented with appropriate defaults and descriptions, aligning well with the PR objectives to add database persistence support.internal/db/config.go (3)
11-21: LGTM - Well-chosen defaults.The default values are sensible: batch size of 100 balances throughput and memory, 1-minute flush interval prevents data loss, and the naming conventions are clear.
69-91: LGTM - Correct configuration loading flow.The function properly handles YAML parsing, environment variable fallback for connection strings, default application, and validation in the correct order. Error wrapping provides good context.
104-125: LGTM - Consistent CLI-to-config conversion.The ToConfig method correctly mirrors the file-based loading logic, ensuring both configuration paths (file vs. CLI flags) go through the same validation and defaulting pipeline.
cmd/httpx/httpx.go (2)
68-69: LGTM - Correct integration placement.The database output setup is called at the appropriate point in the initialization sequence, after asset upload setup and before runner creation, allowing it to configure the result callbacks properly.
151-208: LGTM - Correct callback chaining and error handling.The implementation properly:
- Loads configuration from file or CLI options with appropriate error handling
- Chains callbacks to preserve existing handlers (asset upload, etc.)
- Orders operations correctly: existing OnResult callback executes first, writer closes before existing OnClose
- Uses Fatal for setup errors, which is appropriate at initialization time
internal/db/mongodb.go (3)
15-28: LGTM - Standard factory registration pattern.The init-based registration integrates cleanly with the database registry, and the constructor properly defers connection establishment to the explicit Connect() call.
30-59: LGTM - Robust connection handling with appropriate timeouts.The Connect method uses reasonable timeouts (10s for connection/selection, ping verification) and Close properly handles cleanup with a 5s timeout to prevent blocking during shutdown.
61-87: No action needed. MongoDB'sCreateMany()is idempotent—re-creating indexes that already exist with identical definitions is a no-op and succeeds without error. The index definitions are appropriate for the query patterns, and the implementation will work correctly on subsequent application runs.go.mod (1)
55-61: Dependency versions are appropriate; no critical security concerns identified.The versions in go.mod are current and safe:
- mysql (v1.9.3), lib/pq (v1.10.9): Latest stable releases. CVE-2025-24787 and CVE-2025-1094/4207 are in upstream components (WhoDB and PostgreSQL libpq C library respectively), not these Go drivers.
- mongo-driver (v1.17.6): On the maintained v1.x branch; v2.4 is available but v1.x remains actively supported.
- mapstructure/v2 (v2.4.0): Confirmed as the patched version that fixed the information-leak vulnerability (present in versions < 2.4.0).
- gocsv and publicsuffix-go: Latest versions with no reported security vulnerabilities.
internal/db/writer.go (2)
26-59: LGTM!The
NewWriterfunction properly initializes the database connection, ensures schema, and starts the background goroutine with appropriate lifecycle management.
146-148: LGTM!Simple atomic counter accessor.
internal/db/db.go (2)
31-41: LGTM!The
Databaseinterface is well-defined with clear lifecycle methods and batch insert support.
51-66: LGTM!Good defensive validation before factory invocation.
runner/options.go (2)
354-361: LGTM!New database configuration fields are appropriately added to the Options struct.
498-505: Environment variableHTTPX_DB_CONNECTION_STRINGis properly wired.The flag description for
-rdbcscorrectly mentions the environment variable. While the flag definition doesn't usegoflags.EnvVar, the environment variable is properly handled as a fallback ininternal/db/config.go(lines 81 and 115): if no connection string is provided via the flag, the code reads fromHTTPX_DB_CONNECTION_STRINGviaos.Getenv(). This is a valid design pattern where the flag takes precedence and the environment variable serves as a fallback.Likely an incorrect or invalid review comment.
internal/db/postgres.go (1)
26-38: LGTM!Connect implementation properly opens connection and pings to verify connectivity.
internal/db/mysql.go (1)
26-38: LGTM!Connect implementation correctly opens and pings MySQL database.
- Fix SQL injection in postgres.go using pq.QuoteIdentifier for table/index names - Fix SQL injection in mysql.go using custom quoteIdentifier function - Fix race condition in writer.go by checking closed state before channel send - Add missing idx_url index in mysql.go for parity with PostgreSQL schema - Include RawHeaders in omitRaw check for consistency
Mzack9999
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, some minor suggestion plus what do you think about moving the query to sqlc for easier maintenance later (moving the query to sqlc file and generate glue code with sqlc generate)?
| } | ||
| } | ||
|
|
||
| func (w *Writer) run() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| flagSet.BoolVarP(&options.ResultDatabase, "result-db", "rdb", false, "store results in database"), | ||
| flagSet.StringVarP(&options.ResultDatabaseConfig, "result-db-config", "rdbc", "", "path to database config file"), | ||
| flagSet.StringVarP(&options.ResultDatabaseType, "result-db-type", "rdbt", "", "database type (mongodb, postgres, mysql)"), | ||
| flagSet.StringVarP(&options.ResultDatabaseConnStr, "result-db-conn", "rdbcs", "", "database connection string (env: HTTPX_DB_CONNECTION_STRING)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to use goflags.StringVarEnv
Summary
Adds support for storing httpx scan results directly to databases (MongoDB, PostgreSQL, MySQL) with both CLI flags and YAML config file options.
Features
HTTPX_DB_CONNECTION_STRING-rdbor)New CLI Flags (OUTPUT group)
-rdb, -result-db-rdbc, -result-db-config-rdbt, -result-db-type-rdbcs, -result-db-conn-rdbn, -result-db-name-rdbtb, -result-db-table-rdbbs, -result-db-batch-size-rdbor, -result-db-omit-rawUsage Examples
Example Config File (db-config.yaml)
Testing with Docker
1. Start Database Containers
2. Test httpx with Each Database
3. Verify Results in Database
4. Cleanup
Test plan
Closes #1973
Closes #2360
Closes #2361
Closes #2362
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.