Skip to content

Conversation

@one-kash
Copy link

Summary

The isAscii() extension function in OkHostnameVerifier used length == utf8Size() to determine if a string was ASCII. This was incorrect because Okio's utf8Size() maps malformed surrogates (unpaired high/low surrogates) to the replacement character ? which is 1 byte in UTF-8.

This caused strings like "\uD800.com" (unpaired high surrogate) to incorrectly report as ASCII since:

  • length = 5 (one char for surrogate + 4 for ".com")
  • utf8Size() = 5 (surrogate mapped to '?' = 1 byte + 4 for ".com")

The fix replaces the comparison with a simple character iteration check: all { it.code <= 127 }

Changes

  • OkHostnameVerifier.kt: Changed isAscii() to iterate through characters and check code points
  • HostnameVerifierTest.kt: Added test for malformed surrogates

Test Plan

  • Added malformedSurrogatesAreNotAscii test that verifies unpaired high/low surrogates are rejected
  • All existing HostnameVerifierTest tests continue to pass

Fixes #6357

The previous implementation used `length == utf8Size()` to determine if
a string was ASCII. This was incorrect because Okio's utf8Size() maps
malformed surrogates (unpaired high/low surrogates) to the replacement
character `?` which is 1 byte in UTF-8.

This caused strings like "\uD800.com" (unpaired high surrogate) to
incorrectly report as ASCII since:
- length = 5 (one char for surrogate + 4 for ".com")
- utf8Size() = 5 (surrogate mapped to '?' = 1 byte + 4 for ".com")

The fix iterates through each character and checks if its code point
is within the ASCII range (0-127).

Fixes square#6357

/** Returns true if the [String] is ASCII encoded (0-127). */
private fun String.isAscii() = length == utf8Size().toInt()
private fun String.isAscii() = all { it.code <= 127 }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we want to avoid Iterable.all to avoid an allocation.

Comment on lines +821 to +827
// Unpaired high surrogate - should not match any hostname
assertThat(verifier.verify("\uD800.com", session)).isFalse()
assertThat(verifier.verify("foo\uD800.com", session)).isFalse()

// Unpaired low surrogate - should not match any hostname
assertThat(verifier.verify("\uDC00.com", session)).isFalse()
assertThat(verifier.verify("foo\uDC00.com", session)).isFalse()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would any of these have matched before the change? is there some case where length == utf8Size().toInt() would match because of offsetting issues, longer lower case, and malformed surrogate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

isAscii is wrong for malformed surrogates

2 participants