-
Notifications
You must be signed in to change notification settings - Fork 250
increase chan size to fix oom event lost #386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ningmingxiao
wants to merge
1
commit into
containerd:main
Choose a base branch
from
ningmingxiao:dev_04
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 16..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an experience value, I test it many times. I'm not sure .@mikebrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can try
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuweid when you moved from unbuffered to buffered size 1 for this #374 did you consider a buffer size > 1 .. are there any issues to consider wrt ordering events or closing the channel before it's emptied etc..
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the event lost is not caused by channel side. it's race condition in shim side.
shim sets GOMAXPROCS=4 and in CI action, critest runs 8 cases parrallel. So, ideally, it will run 8 pods in the same time on 4 cpu cores node. If shim gets exit event first then we read oom event, it will cause we can lost event.
I am thinking we should drain select-oom-event goroutine before sending exit event in shim side. let me do some performance or density tests for that. I will update it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair.. there may another window where we get the task exit event on the cri side first then a request for and report status as exited with error reason.. then receive the oom event and update container exit reason.. then if they ask again we give status with reason? The two events use up two checkpoints both protected by the same container store global mutex and because of this lock and storage locks.. the "tight" window for the racing oom test.
Thinking we might want to check if there is an exit reason queued up first before reporting status while it's exiting.. same for the generateAndSendContainerEvent() to the kubelet at the bottom of the task exit.. if we don't get the oom event first we have a window where we report with no reason for the task exit.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worse.. when we get the container status:
if exited but reason is nil.. we invent an exit reason if there is non 0 exit code
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add some log I find If I increase the size will have less chance to cause oom event lost.The oom event really happened but containerd failed to catch it.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final time I read from memory.events is