-
Notifications
You must be signed in to change notification settings - Fork 709
Description
I was reading over implicit context caching here https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview, and my understanding is that the implicit caching would work with partial hits as long as the prefix is fixed. However, I was only able to retrieve non-zero cached_content_token_count when the requests were exactly the same.
Using a slightly different code from the notebook in the docs gives me (at least looking at the usage metadata) no cache hit at all
def main():
client = Client(
vertexai=True,
project=GCP_PROJECT,
location="us-central1",
)
MODEL_ID = "gemini-2.5-flash"
NUM_ATTEMPTS = 3
texts = [
"Write a short and engaging blog post based on this image.",
"Describe this image with three words.",
"What is this image about?",
]
for i in range(NUM_ATTEMPTS):
response = client.models.generate_content(
model=MODEL_ID,
contents=[
types.Part.from_uri(
file_uri="https://storage.googleapis.com/cloud-samples-data/generative-ai/image/a-man-and-a-dog.png",
mime_type="image/png",
),
texts[i],
],
)
cached_token_count = response.usage_metadata.cached_content_token_count or 0
print(f"#{i + 1} Attempt")
print(f"Input tokens: {response.usage_metadata.prompt_token_count}")
print(f"Cached tokens: {cached_token_count}")
print(f"Output tokens: {response.usage_metadata.candidates_token_count}")
print(f"Total tokens: {response.usage_metadata.total_token_count}")
print()
if cached_token_count > 0:
print(response.usage_metadata.cache_tokens_details)Results in
#1 Attempt
Input tokens: 2334
Cached tokens: 0
Output tokens: 316
Total tokens: 4012
#2 Attempt
Input tokens: 2329
Cached tokens: 0
Output tokens: 6
Total tokens: 3208
#3 Attempt
Input tokens: 2328
Cached tokens: 0
Output tokens: 259
Total tokens: 3527
I was expecting at least the image tokens to be cached.
Are partial token count hits available?
Also, is there a difference for caching system instructions (config parameter) versus contents?
The way we've been working with is having the system instructions as fixed instructions defined as a types.GenerateContentConfig and the variable text as types.Part(text=text)