-
Notifications
You must be signed in to change notification settings - Fork 21
Initial Icinga Notifications Source #998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
98939bd to
25677ca
Compare
|
I've changed quite a few things since your last push, so I'll summarize the changes here for you:
Also, all the individual commits are self-contained, and includes the relevant commit message with some details about Footnotes |
julianbrost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the inline comments, there's also a more general question that still needs consideration: error handling. In the sense that what happens if submitting the event to Icinga Notifications doesn't work. With the current implementation, it just discarded and thereby lost. That would be solved if the submission was synchronous in the history pipeline, just as writing the history to the SQL database currently is. In that case, the event would only be deleted from Redis after it was both written to the database and sent to Icinga Notifications (on the other hand, if Icinga Notifications is unavailable, this would probably prevent history from being written).
This would also solve a related issue currently present in the implementation: once it's written to the Go channel, it's considered done. If Icinga DB is stopped, there may be still events in flight and there seems to be nothing that would wait for them. So unless I'm missing something, a clean restart of the Icinga DB process might result in lost events.
6dd9d96 to
a199863
Compare
julianbrost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question on something I didn't think about before: the history is passed over multiple streams that are processed independently. What consideration did you put into this? At first glance, this sounds like a nasty source of race conditions to me if things happen out of order.
| case "downtime_end": | ||
| if h.HasBeenCancelled.Valid && h.HasBeenCancelled.Bool { | ||
| ev.Type = event.TypeDowntimeRemoved | ||
| ev.Message = "Downtime was cancelled" | ||
|
|
||
| if h.CancelledBy.Valid { | ||
| ev.Username = h.CancelledBy.String | ||
| } | ||
| } else { | ||
| ev.Type = event.TypeDowntimeEnd | ||
| ev.Message = "Downtime expired" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the unmute happening? And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the unmute happening?
Looks like you've added a ev.Mute = types.MakeBool(false) in case "downtime_end" to address this.
However, buildFlappingHistoryEvent() has the same issue:
icingadb/pkg/notifications/notifications.go
Lines 295 to 319 in d675257
| // buildFlappingHistoryEvent from a flapping history entry. | |
| func (client *Client) buildFlappingHistoryEvent(ctx context.Context, h *v1history.FlappingHistory) (*event.Event, error) { | |
| ev, _, err := client.buildCommonEvent(ctx, h.HostId, h.ServiceId) | |
| if err != nil { | |
| return nil, errors.Wrapf(err, "cannot build event for %q,%q", h.HostId, h.ServiceId) | |
| } | |
| if h.PercentStateChangeEnd.Valid { | |
| ev.Type = event.TypeFlappingEnd | |
| ev.Message = fmt.Sprintf( | |
| "Checkable stopped flapping (Current flapping value %.2f%% < low threshold %.2f%%)", | |
| h.PercentStateChangeEnd.Float64, h.FlappingThresholdLow) | |
| } else if h.PercentStateChangeStart.Valid { | |
| ev.Type = event.TypeFlappingStart | |
| ev.Message = fmt.Sprintf( | |
| "Checkable started flapping (Current flapping value %.2f%% > high threshold %.2f%%)", | |
| h.PercentStateChangeStart.Float64, h.FlappingThresholdHigh) | |
| ev.Mute = types.MakeBool(true) | |
| ev.MuteReason = "Checkable is flapping" | |
| } else { | |
| return nil, errors.New("flapping history entry has neither percent_state_change_start nor percent_state_change_end") | |
| } | |
| return ev, nil | |
| } |
And is there any replacement for the old
isMuted()function that checks if the checkable is even to be unmuted?
That part seems unaddressed though. That's a more fundamental issue and I don't see anything in the implementation that would handle nested muting by downtimes and flapping in any way (nor do I see an easy way to add it). Am I missing something here? (Or is this basically a limitation of getting the events this way?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However,
buildFlappingHistoryEvent()has the same issue:
Yes, I have added an unmute there as well.
Furthermore, the current acknowledgment handling has no muting or unmuting at all, while there was a conditional muting/unmuting in the old implementation. So far, I have just added it always, but this brings us straight to the next topic.
And is there any replacement for the old
isMuted()function that checks if the checkable is even to be unmuted?That part seems unaddressed though. That's a more fundamental issue and I don't see anything in the implementation that would handle nested muting by downtimes and flapping in any way (nor do I see an easy way to add it). Am I missing something here? (Or is this basically a limitation of getting the events this way?)
I am a bit unsure how to handle this. With this "stateless" pipeline approach, I don't see an easy way to address this. In theory, I could somehow dig most of the information from Redis, but this wouldn't be beautiful, and potentially expensive.
ed7b0d1 to
0ab4f90
Compare
|
Note: After talking with @lippserd, we realized that the custom vars are required for permission filtering in Web. I would check how to add the custom vars from Icinga DB in a "quick fix" manner. |
48248d9 to
214bfe2
Compare
|
Icinga/icingadb-web#1289 now prepares this as filter: {
"version": 1,
"config": {
"type": "all",
"filter": "host.name=docker-master"
},
"queries": {
"host": {
"query": "SELECT (1) FROM host WHERE ((host.id = ?) AND (host.environment_id = ?)) AND (host.name = ?) LIMIT 1",
"parameters": [":host_id", ":environment_id", "docker-master"]
},
"service": {
"query": "SELECT (1) FROM service LEFT JOIN host service_host ON service_host.id = service.host_id WHERE ((service.id = ?) AND (service.environment_id = ?)) AND (service_host.name = ?) LIMIT 1",
"parameters": [":service_id", ":environment_id", "docker-master"]
}
}
}
|
The rules are no longer just plain SQL queries, but have now their own JSON format, introduced by Icinga DB Web. This format is now supported by Client.evaluateRulesForObject. - Icinga/icingadb-web#1289 - #998 (comment)
I thought about this again, and noticed another limitation: Transferring custom variables 1:1 might be enough if users want to match hosts with host vars and services with service vars, but the following is not possible:
It's not possible to re-use existing restrictions by simply copying them over. Either because of missing data (Case 1.) or because of false positives (Case 2.). Case 1 can be fixed by just transferring all host vars to all of their services as well. We are already dealing with huge amounts duplicated rows though… But still, no simple way to re-use existing restrictions. But at least somewhat more usable than just a 1:1 mapping. |
Refactor the telemetry.Stats to allow custom names. This enabled dynamic callback names for the Redis history sync, used by Icinga Notifications.
- Don't validate notifications config in a background Goroutine. - Clip pipeline slice to avoid reusing capability twice. - Rework notification Client.buildCommonEvent and depending methods. - Resubmit events after updating rules in one go. - Simplify Client.fetchHostServiceName based on Julian's suggestion. Co-Authored-By: Julian Brost <julian.brost@icinga.com>
After reintroducing Event.ExtraTags in the IGL and Icinga Notifications, Icinga DB populates events by their custom variables. At the moment, the required customvars are fetched from Redis for each event. Due to the Redis schema, at least on HGETALL with manual filtering is required. This might be a good candidate for further caching, and cache invalidation.
The rules are no longer just plain SQL queries, but have now their own JSON format, introduced by Icinga DB Web. This format is now supported by Client.evaluateRulesForObject. - Icinga/icingadb-web#1289 - #998 (comment)
The StreamSorter was added to history, allowing to collect messages from multiple Redis streams, sorting them based on the timestamp in the Stream ID, and ejecting them back. This is used for the callback stage, required by Icinga Notification. In the Notification context, an ordered stream is required. Despite my best intention, it felt like I have created an Erlang.
Rework the prior custom variable fetching code to no longer fetch everything in a looping fashion from Redis, but send SQL queries for custom variables now. In addition, for service objects now contain both the service and host custom variables, prefixed by "host.vars." or "service.vars.".
Refactor multiple variables into common struct to ease handling.
- Ignore the "config" part of the JSON struct which is only relevant for Icinga DB Web. - Remove unnecessary string conversions. - Small code changes/improvements.
Effectively reverting cf4bd92 and passing a pointer to the relevant com.Counter to the history sync.
With parseRedisStreamId being in the "hot path", the quite simple regular expression was exchanged with an even simpler string split. However, as network operations precede and follow this, the benefit of this optimization might be questionable.
- Store the streamSorterSubmission submission time as a time.Time instead of a nanosecond timestamp, comparing the time.Timer's monotonic clock. - Replace time-based buckets in StreamSorter.submissionWorker by a heap to be pushed and popped. However, only submissions of a certain age are being forwarded. Reduces complexity quite a bit. - Reduce complexity of StreamSorter.queueWorker by getting rid of unnecessary channel signals by checking for new queue events for processing at the loop start.
Introduce the StreamSorter.CloseOutput method to remove all submissions for a certain output channel from both workers before closing the channel. The motivation behind this change is to have a single point where the output channel is closed while no submissions are being sent into an already closed channel.
Seems to be forgotten in e475a5e.
Next to minor changes, the custom variables are now fetched from customvar_flat, being in their flat format.
Next to some other small cleanups, the streamSorterSubmissions slice type now references pointers.
…ementation This commit is pretty much an overhaul of the implementation to allow for a more straight-forward way to close the output channel. The main changes to the implementation are: - StreamSorter now provides a method PipelineFunc that can directly be used in a history sync pipeline. This allows StreamSorter to handle the in + out stream pair internally, so that it closes out after in was closed and all messages from it were passed to out. - The two worker goroutines were combined into a single one and the secondary queue was removed. All pending messages remain in the heap and will only be removed from the heap when they are about to be passed to the callback. - The worker now handles all operations (send and close) on the output stream.
After Julian reworked big parts of the StreamSorter for the better, I went over the code multiple times and added a few comments for parts I had to read twice. Within the tests, there might be a data race when zaptest is used after the test's context is done. Since there were a few log messages potentially occurring after the test's end, a guard was added to ensure no verbose log messages are being produced if the context is done.
The parameters can not only be strings, but anything to PHP's liking. In one example, an integer was observed. Since Parameters is converted to an []any later anyways, this is no real change in behavior.
The whole StreamSorter logic is only required for Icinga Notifications. Thus, the implementation was moved from the history package to the notifications package, removing some unnecessary generalizations on the way. This results in big changes to be made in the notifications package, while other modules are mostly not affected.
Change the message for TypeAcknowledgementCleared events to a more obvious one.
Populate the Event's Mute field for muting and unmuting for flapping events and acknowledgements.
Allow configurable timeouts for the StreamSorter, to set them to a fraction of their default for the tests. Now the tests are done in three seconds instead of three minutes. While doing so, another race condition with the test logging was unveiled. Since this race results from a closing test context and test logger, there was not much to do and I decided to just drop the logging message, which was used only for tests anyway.
Remove the generic information from 01-About.md and add a warning to 03-Configuration.md that this feature might not be stable.
Drop the "-source" suffix from the configuration option. Furhtermore, with the latest IGL release 0.8.1[0], the "api-base-url" is renamed to just "url". [0]: Icinga/icinga-go-library#168
fea8170 to
327862c
Compare
This is the first version to use Icinga DB as an event source for Icinga Notifications. If configured accordingly, Icinga DB forwards events crafted from the Redis pipeline to the Icinga Notifications API.
This required a small refactoring of the history synchronization to allow hooking into the Redis stream. Afterwards, the newly introduced notifications package handles the rest.
Note: As part of this architectural change, Icinga Notifications offers filters to be evaluated by Icinga DB. At the moment, these are SQL queries being executed on the Icinga DB relational database. Either consider both Icinga DB and Icinga Notifications to be part of the same trust domain or consider the security implications.
Furthermore, this change requires a change on Icinga Notifications as well. This will not work with the current version 0.1.1.
Other PRs are