Skip to content

Verify and document queue-proxy pod statistics security#16366

Merged
knative-prow[bot] merged 2 commits intoknative:mainfrom
sonikaarora:verify-queue-proxy-pod-statistics
Feb 4, 2026
Merged

Verify and document queue-proxy pod statistics security#16366
knative-prow[bot] merged 2 commits intoknative:mainfrom
sonikaarora:verify-queue-proxy-pod-statistics

Conversation

@sonikaarora
Copy link
Contributor

@sonikaarora sonikaarora commented Jan 29, 2026

Summary

This PR verifies and documents that queue-proxy instances can only report statistics for their own pod and cannot spoof metrics for other pods.

Changes

  • Added TestProtobufStatsReporterPodNameImmutable: Verifies that the pod name remains immutable across multiple stat reports
  • Added TestProtobufStatsReporterMultiplePods: Verifies that separate queue-proxy instances maintain independent pod identities
  • Enhanced documentation: Added security property documentation to ProtobufStatsReporter explaining immutable pod identity binding

Verification

The implementation was inspected and verified to ensure:

  1. Pod name is set once at construction from Kubernetes environment (env.ServingPod)
  2. The podName field is private and has no setter methods
  3. All statistics reports use this immutable pod identity
  4. The autoscaler correctly attributes metrics to the appropriate pod and revision

Test Results

All tests pass including the new security verification tests:

=== RUN   TestProtobufStatsReporterPodNameImmutable
--- PASS: TestProtobufStatsReporterPodNameImmutable (0.00s)
=== RUN   TestProtobufStatsReporterMultiplePods
--- PASS: TestProtobufStatsReporterMultiplePods (0.00s)

Coverage: 97.1% of statements in pkg/queue

Backward Compatibility

✅ No breaking changes
✅ All existing tests pass
✅ No changes to public APIs

Related: #11015

This change verifies and documents that queue-proxy instances can only
report statistics for their own pod and cannot spoof metrics for other pods.

Changes:
- Added TestProtobufStatsReporterPodNameImmutable to verify pod name
  immutability across multiple stat reports
- Added TestProtobufStatsReporterMultiplePods to verify separate pod
  identities are maintained
- Enhanced ProtobufStatsReporter documentation explaining the security
  property of immutable pod identity binding

The podName field is set once at construction from the Kubernetes
environment and cannot be modified, ensuring correct metric attribution
for autoscaling decisions.
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2026
@knative-prow knative-prow bot requested review from gauron99, linkvt and skonto January 29, 2026 06:00
@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 29, 2026
@knative-prow
Copy link

knative-prow bot commented Jan 29, 2026

Hi @sonikaarora. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sonikaarora sonikaarora marked this pull request as ready for review January 29, 2026 06:03
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2026
@knative-prow knative-prow bot requested a review from dprotaso January 29, 2026 06:04
@linkvt
Copy link
Contributor

linkvt commented Jan 29, 2026

Hi,
thanks for your PR. I just checked it and the referenced issue and don't think that this was meant in the issue description.

If I understand your tests right they basically only check that golang works correctly and doesn't mix properties of multiple instances of the same struct - IMO no need to test this 😀 .

Considering that the issue kind is documentation and considering this quote from the issue I think that this PR is not the correct approach to solve it:

Inspect code and verify that when autoscaler calls the queue proxy, all the statistics from the queue-proxy are associated with the Revision and Pod associated with the queue proxy.

@linkvt
Copy link
Contributor

linkvt commented Jan 29, 2026

/hold

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
@sonikaarora sonikaarora marked this pull request as draft January 29, 2026 17:58
@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 29, 2026
@sonikaarora sonikaarora force-pushed the verify-queue-proxy-pod-statistics branch from d3b0f24 to 85fc1c7 Compare January 31, 2026 03:53
@sonikaarora sonikaarora marked this pull request as ready for review January 31, 2026 03:54
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2026
@knative-prow knative-prow bot requested a review from dsimansk January 31, 2026 03:54
@sonikaarora
Copy link
Contributor Author

Hi @linkvt,

Thanks for the feedback! You're absolutely right - my initial tests were too basic.

I've updated the PR to address your concern. The key change is a new integration test: TestScrapeOnlyScrapesPodsForCorrectRevision (in pkg/autoscaler/metrics/stats_scraper_test.go).

This test verifies the complete autoscaler → queue-proxy flow:

  • Creates pods for two different revisions (revision-a and revision-b)
  • Configures the autoscaler scraper to target only revision-a
  • Verifies that only revision-a pods are scraped (by checking HTTP requests made)
  • Confirms that revision-b pods are never contacted

This directly addresses the issue requirement: "when autoscaler calls the queue proxy, all the statistics from the queue-proxy are associated with the Revision and Pod".

The original unit tests remain (they verify pod-level immutability) but are now complemented by this end-to-end test showing revision isolation via Kubernetes labels and PodAccessor filtering.

Would appreciate another look when you have time. Thanks!

@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 2, 2026
@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

/assign @evankanderson for input

I read the title of the issue and thought this was about the autoscaler verifying the source of metrics are coming from a specific pod/revision. But seeing the label about documentation and the body of the issue I'm unsure about the requirements.

@knative-prow
Copy link

knative-prow bot commented Feb 2, 2026

@dprotaso: GitHub didn't allow me to assign the following users: for, input.

Note that only knative members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @evankanderson for input

I read the title of the issue and thought this was about the autoscaler verifying the source of metrics are coming from a specific pod/revision. But seeing the label about documentation and the body of the issue I'm unsure about the requirements.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


// TestProtobufStatsReporterMultiplePods verifies that different queue-proxy instances
// maintain separate pod identities and cannot report for each other.
func TestProtobufStatsReporterMultiplePods(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this test. It's not really doing anything more than TestProtobufStatsReporterPodNameImmutable

// that statistics are correctly attributed to the appropriate revision.
//
// Related to: https://github.com/knative/serving/issues/11015
func TestScrapeOnlyScrapesPodsForCorrectRevision(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test scraper only really supports scraping one revision's pods - so we shouldn't seed stats for revision B

// Create a scraper specifically for revision-a
// The fake client will return stats for all 4 pods, but the scraper
// should only scrape the 2 pods belonging to revision-a
allStats := append(statsRevA, statsRevB...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats here shouldn't include statsRevB because the test scraper returns stats in the order of invocation

For example - if I were to call scrape 4 times then you'd get stats rev b being returned in the scraper. This would mean the mock is not behaving like the real scraper.

@evankanderson
Copy link
Member

Yes, the original issue is to verify that when the autoscaler is reading metrics from a specific pod. In particular, https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/stats_scraper.go#L326 passes a stat object deserialized from the remote "queue-proxy" at https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/http_scrape_client.go#L80.

We can't actually tell whether it's a queue proxy or something malicious; we do prevent a DoS by limiting the buffer size. We should also prevent it from replying for a different Knative service, but I don't think anyone spent the time to analyze the call stack to see if this was a reasonable concern. We seem to have a serviceScraper for each Knative revision (if I read the code correctly), which seems to be using a Knative-created k8s Service to locate the pods. It looks like the Stat message may not include namespace / revision information, so this may be a moot attack vector.

@evankanderson
Copy link
Member

Yes, the original issue is to verify that when the autoscaler is reading metrics from a specific pod. In particular, https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/stats_scraper.go#L326 passes a stat object deserialized from the remote "queue-proxy" at https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/http_scrape_client.go#L80.

We can't actually tell whether it's a queue proxy or something malicious; we do prevent a DoS by limiting the buffer size. We should also prevent it from replying for a different Knative service, but I don't think anyone spent the time to analyze the call stack to see if this was a reasonable concern. We seem to have a serviceScraper for each Knative revision (if I read the code correctly), which seems to be using a Knative-created k8s Service to locate the pods. It looks like the Stat message may not include namespace / revision information, so this may be a moot attack vector.

I've read the code. It appears that the autoscaler works as follows:

  1. MetricCollector implements Collector for interfacing with the k8s watcher interface, and MetricClient for interfacing with the rest of the scaling system.
  2. Internally, MetricCollector keeps a map of NamespacedName revisions to stats collectors. This map is updated by watching the Kubernetes apiserver for updates to the autoscaling.internal.knative.dev/v1alpha1/Metric custom resource.
  3. For each Metric resource, the autoscaler constructs a separate watcher on the pods specified by the k8s Service reference in the Metric (same-namespace). The watcher tracks lifecycle of a StatsScraper and metric aggregations associated with the Metric.
  4. Stat reports are never accumulated outside a Metric custom resource.
  5. The rest of the autoscaling infrastructure requests scaling values based on the NamespacedName from item 2; if there is no corresponding Metric object, an ErrNotCollecting will be returned.

Based on this structure, the most damage an attacker with editor on all types in a namespace (the normal edit role does not have Metric permissions) could do would be:

  • Create additional Metric objects -- add some additional work to the autoscaler.
  • Change a Metric's linked k8s Service to point to a non-queue-proxy pod (either in their own namespace or someone else's).
    • For a within-namespace pod, you could report additional load and cause additional scaling
    • For an outside-namespace pod, you might also be able to intuit some measure of load / scale. To do this, you would need to be able to enumerate the Pod IPs of the target namespace to fill them into the Service's Endpoints. But if you can already enumerate the Pod IPs of the target namespace, you already have some scaling information...

In particular, it would not be possible to use access to the Metric or Stat responses from an attacker controlled pod to change the scaling for a Revision which is outside the attacker's namespace.

@evankanderson
Copy link
Member

That's not to say that these might not be useful changes, but the "investigate" part of the original issue really was "investigate" before proposing fixes.

@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

FWIW - I'm good to merge the PR (with the requested changes) though I'd say it doesn't fix the target issue.

@sonikaarora sonikaarora force-pushed the verify-queue-proxy-pod-statistics branch from 85fc1c7 to cc0a491 Compare February 4, 2026 03:26
@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.22%. Comparing base (4004491) to head (cc0a491).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16366      +/-   ##
==========================================
+ Coverage   80.17%   80.22%   +0.05%     
==========================================
  Files         216      216              
  Lines       13440    13440              
==========================================
+ Hits        10775    10782       +7     
+ Misses       2300     2294       -6     
+ Partials      365      364       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sonikaarora
Copy link
Contributor Author

sonikaarora commented Feb 4, 2026

FWIW - I'm good to merge the PR (with the requested changes) though I'd say it doesn't fix the target issue.

@dprotaso Thanks for the review! I've made both changes:

  1. Removed TestProtobufStatsReporterMultiplePods
  2. Fixed the test to only seed statsRevA in the mock client

Updated commit is pushed. Please take another look when you get a chance.

@dprotaso
Copy link
Member

dprotaso commented Feb 4, 2026

/retest

@dprotaso
Copy link
Member

dprotaso commented Feb 4, 2026

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2026
@dprotaso
Copy link
Member

dprotaso commented Feb 4, 2026

/hold cancel

@knative-prow
Copy link

knative-prow bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, sonikaarora

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 4, 2026
@knative-prow knative-prow bot merged commit 2cd582b into knative:main Feb 4, 2026
152 of 155 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants