Skip to content

Conversation

@oxzi
Copy link
Member

@oxzi oxzi commented Jul 29, 2025

This is the first version to use Icinga DB as an event source for Icinga Notifications. If configured accordingly, Icinga DB forwards events crafted from the Redis pipeline to the Icinga Notifications API.

This required a small refactoring of the history synchronization to allow hooking into the Redis stream. Afterwards, the newly introduced notifications package handles the rest.

Note: As part of this architectural change, Icinga Notifications offers filters to be evaluated by Icinga DB. At the moment, these are SQL queries being executed on the Icinga DB relational database. Either consider both Icinga DB and Icinga Notifications to be part of the same trust domain or consider the security implications.

Furthermore, this change requires a change on Icinga Notifications as well. This will not work with the current version 0.1.1.


Other PRs are

@oxzi oxzi added this to the 1.5.0 milestone Jul 29, 2025
@oxzi oxzi added enhancement New feature or request area/notifications labels Jul 29, 2025
@cla-bot cla-bot bot added the cla/signed label Jul 29, 2025
@oxzi oxzi marked this pull request as draft July 29, 2025 14:45
@yhabteab yhabteab force-pushed the icinga-notifications-init branch 2 times, most recently from 98939bd to 25677ca Compare August 8, 2025 13:18
@yhabteab
Copy link
Member

yhabteab commented Aug 8, 2025

I've changed quite a few things since your last push, so I'll summarize the changes here for you:

  • You previously used the any type to represent the history entries streamed from the history pipelines, but since we know they are going to be one of the v1/history types, I've replaced any with our database.Entity type, which is the common type for all our v1 database models. I've pushed this change directly into your commit, so you won't find extra commit for this change.
  • Instead of having to maintain duplicate types of the Icinga Notifications Event, Severity etc. types, I've taken the liberty to outsource them into our Icinga Go Library 1, so that they can be used by all Icinga Notifications sources (ours or external ones) built in Go. Additionally, I've introduced there a common Icinga Notifications client that can be used to interact with the Icinga Notifications API in a common way. Thus, some of the methods you previously implemented on the Source type have been moved to the new client with some minor adjustments to make them more generic and reusable. So, 92b4e4a is the commit that makes use of that new client and also removes the now obsolete types and methods from the pkg/notifications package.
  • I know it's only for the time being, but you previously limited the query parameters to be used for the SQL queries of the event rules config ObjectFilter to only {host,service,environment}_id but since the v1/history types provide numerous other fields that can be used for filtering, I've removed this limitation and now the whole entity is passed to the sqlx query builder, so that any field can be used for filtering as long as it is supported by that concrete entity. You'll find this change in the commit 4568b93.
  • Since the v1 entities don't have access to the actual Host or Service names, you've previously performed SQL queries to the hosts and services tables to get the names of the hosts and services, but since these database queries were triggered with incoming history entry, I found it a bit inefficient, so I've changed that to retrieve the names from Redis via the HGET command instead, which doesn't require SQL parsing and round trip to the database. I believe this will bring a significant performance improvement, especially in a big Icinga environment with numerous objects. If you don't see it that way, you can also revert it, and we'll discuss it further when I'm back. Performance-wise, there's still a room for improvement though, because when starting Icinga 2 with a lot of objects, Icinga Notifications still takes a while to actually receive a state change event triggered 1-3m after Icinga DB Web shows it in the history view (e.g. when you submit a process-check-result with OK state, the respective incident in Icinga Notifications won't be closed until 1-3m later). The commit that implements this change is b01e61c.
  • Previously, when Icinga Notifications rejected the process event request with outdated rules version, you have posted the current event in a background in a go s.Submit() manner, so that it can be re-processed by the worker and re-evaluate the newly updated rules. However, since the workers input channel is buffered, this can mess up the order of the events when there are already other events waiting in the queue. So, I've changed this to directly re-evaluate the event using a magic but simple goto ... statement. The commit that implements this change is 55753b5.
  • Finally, I've dropped the rulesMutex mutex that was used to protect the rules map, since it is no longer needed. May be you've planned to parallelize the event submission at some point, but at the moment, there's no concurrent access to it, so I've dropped it entirely with 3549849 for the time being.

Also, all the individual commits are self-contained, and includes the relevant commit message with some details about
the changes, so you can always refer to them to see the justifications for the changes.

Footnotes

  1. https://github.com/Icinga/icinga-go-library/pull/145

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the inline comments, there's also a more general question that still needs consideration: error handling. In the sense that what happens if submitting the event to Icinga Notifications doesn't work. With the current implementation, it just discarded and thereby lost. That would be solved if the submission was synchronous in the history pipeline, just as writing the history to the SQL database currently is. In that case, the event would only be deleted from Redis after it was both written to the database and sent to Icinga Notifications (on the other hand, if Icinga Notifications is unavailable, this would probably prevent history from being written).

This would also solve a related issue currently present in the implementation: once it's written to the Go channel, it's considered done. If Icinga DB is stopped, there may be still events in flight and there seems to be nothing that would wait for them. So unless I'm missing something, a clean restart of the Icinga DB process might result in lost events.

@oxzi oxzi force-pushed the icinga-notifications-init branch 4 times, most recently from 6dd9d96 to a199863 Compare September 9, 2025 10:10
@oxzi oxzi requested a review from julianbrost September 9, 2025 13:02
oxzi added a commit to Icinga/icinga2 that referenced this pull request Sep 25, 2025
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question on something I didn't think about before: the history is passed over multiple streams that are processed independently. What consideration did you put into this? At first glance, this sounds like a nasty source of race conditions to me if things happen out of order.

Comment on lines 238 to 286
case "downtime_end":
if h.HasBeenCancelled.Valid && h.HasBeenCancelled.Bool {
ev.Type = event.TypeDowntimeRemoved
ev.Message = "Downtime was cancelled"

if h.CancelledBy.Valid {
ev.Username = h.CancelledBy.String
}
} else {
ev.Type = event.TypeDowntimeEnd
ev.Message = "Downtime expired"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the unmute happening? And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the unmute happening?

Looks like you've added a ev.Mute = types.MakeBool(false) in case "downtime_end" to address this.

However, buildFlappingHistoryEvent() has the same issue:

// buildFlappingHistoryEvent from a flapping history entry.
func (client *Client) buildFlappingHistoryEvent(ctx context.Context, h *v1history.FlappingHistory) (*event.Event, error) {
ev, _, err := client.buildCommonEvent(ctx, h.HostId, h.ServiceId)
if err != nil {
return nil, errors.Wrapf(err, "cannot build event for %q,%q", h.HostId, h.ServiceId)
}
if h.PercentStateChangeEnd.Valid {
ev.Type = event.TypeFlappingEnd
ev.Message = fmt.Sprintf(
"Checkable stopped flapping (Current flapping value %.2f%% < low threshold %.2f%%)",
h.PercentStateChangeEnd.Float64, h.FlappingThresholdLow)
} else if h.PercentStateChangeStart.Valid {
ev.Type = event.TypeFlappingStart
ev.Message = fmt.Sprintf(
"Checkable started flapping (Current flapping value %.2f%% > high threshold %.2f%%)",
h.PercentStateChangeStart.Float64, h.FlappingThresholdHigh)
ev.Mute = types.MakeBool(true)
ev.MuteReason = "Checkable is flapping"
} else {
return nil, errors.New("flapping history entry has neither percent_state_change_start nor percent_state_change_end")
}
return ev, nil
}

And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?

That part seems unaddressed though. That's a more fundamental issue and I don't see anything in the implementation that would handle nested muting by downtimes and flapping in any way (nor do I see an easy way to add it). Am I missing something here? (Or is this basically a limitation of getting the events this way?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, buildFlappingHistoryEvent() has the same issue:

Yes, I have added an unmute there as well.

Furthermore, the current acknowledgment handling has no muting or unmuting at all, while there was a conditional muting/unmuting in the old implementation. So far, I have just added it always, but this brings us straight to the next topic.

And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?

That part seems unaddressed though. That's a more fundamental issue and I don't see anything in the implementation that would handle nested muting by downtimes and flapping in any way (nor do I see an easy way to add it). Am I missing something here? (Or is this basically a limitation of getting the events this way?)

I am a bit unsure how to handle this. With this "stateless" pipeline approach, I don't see an easy way to address this. In theory, I could somehow dig most of the information from Redis, but this wouldn't be beautiful, and potentially expensive.

@oxzi oxzi force-pushed the icinga-notifications-init branch from ed7b0d1 to 0ab4f90 Compare October 21, 2025 14:40
@oxzi
Copy link
Member Author

oxzi commented Oct 22, 2025

Note: After talking with @lippserd, we realized that the custom vars are required for permission filtering in Web. I would check how to add the custom vars from Icinga DB in a "quick fix" manner.

@oxzi oxzi force-pushed the icinga-notifications-init branch from 48248d9 to 214bfe2 Compare October 22, 2025 16:15
@nilmerg
Copy link
Member

nilmerg commented Oct 24, 2025

Icinga/icingadb-web#1289 now prepares this as filter:

{
  "version": 1,
  "config": {
    "type": "all",
    "filter": "host.name=docker-master"
  },
  "queries": {
    "host": {
      "query": "SELECT (1) FROM host WHERE ((host.id = ?) AND (host.environment_id = ?)) AND (host.name = ?) LIMIT 1",
      "parameters": [":host_id", ":environment_id", "docker-master"]
    },
    "service": {
      "query": "SELECT (1) FROM service LEFT JOIN host service_host ON service_host.id = service.host_id WHERE ((service.id = ?) AND (service.environment_id = ?)) AND (service_host.name = ?) LIMIT 1",
      "parameters": [":service_id", ":environment_id", "docker-master"]
    }
  }
}

config.type may also be just "host" or "service" and queries then only contains a single respective key.

oxzi added a commit that referenced this pull request Oct 27, 2025
The rules are no longer just plain SQL queries, but have now their own
JSON format, introduced by Icinga DB Web. This format is now supported
by Client.evaluateRulesForObject.

- Icinga/icingadb-web#1289
- #998 (comment)
@nilmerg
Copy link
Member

nilmerg commented Oct 27, 2025

Note: After talking with @lippserd, we realized that the custom vars are required for permission filtering in Web. I would check how to add the custom vars from Icinga DB in a "quick fix" manner.

I thought about this again, and noticed another limitation: Transferring custom variables 1:1 might be enough if users want to match hosts with host vars and services with service vars, but the following is not possible:

  1. Matching services with host vars
    This is a common thing, as declaring a host variable and use it in apply rules to apply services makes totally sense to also use as restriction. Though with the 1:1 mapping, services only get their own variables assigned, not those of their host.
  2. Matching only hosts or services, where their service's variables or their host's variables (respectively) match
    Consider host.vars.os=Linux. Using this to filter services only matches services whose host has set os=Linux
    Also: service.vars.port=443. Using this to filter hosts only matches hosts who have a service that has set port=443
    With the 1:1 mapping and no way to differentiate the object type, object.vars.os=Linux matches also services that are not related to a host with this variable, but have their own variable set due the first case

It's not possible to re-use existing restrictions by simply copying them over. Either because of missing data (Case 1.) or because of false positives (Case 2.).

Case 1 can be fixed by just transferring all host vars to all of their services as well. We are already dealing with huge amounts duplicated rows though…
Case 2 can only be partially fixed by introducing an extra ID tag type and then doing: object.tag.type=host&object.vars.os=Linux. This is even more important when considering the fix for Case 1. Only partially because there is no way to express relationship checks, we cannot support matching objects based on another object's configuration. (We might, by introducing a dynamic join allowing to join objects where a subset of ID tags match, but I don't want to see the query plan for this monstrosity if multiple extra tags are part of a restriction.)

But still, no simple way to re-use existing restrictions. But at least somewhat more usable than just a 1:1 mapping.

oxzi and others added 25 commits November 17, 2025 09:20
Refactor the telemetry.Stats to allow custom names. This enabled dynamic
callback names for the Redis history sync, used by Icinga Notifications.
- Don't validate notifications config in a background Goroutine.
- Clip pipeline slice to avoid reusing capability twice.
- Rework notification Client.buildCommonEvent and depending methods.
- Resubmit events after updating rules in one go.
- Simplify Client.fetchHostServiceName based on Julian's suggestion.

Co-Authored-By: Julian Brost <julian.brost@icinga.com>
After reintroducing Event.ExtraTags in the IGL and Icinga Notifications,
Icinga DB populates events by their custom variables.

At the moment, the required customvars are fetched from Redis for each
event. Due to the Redis schema, at least on HGETALL with manual
filtering is required. This might be a good candidate for further
caching, and cache invalidation.
The rules are no longer just plain SQL queries, but have now their own
JSON format, introduced by Icinga DB Web. This format is now supported
by Client.evaluateRulesForObject.

- Icinga/icingadb-web#1289
- #998 (comment)
The StreamSorter was added to history, allowing to collect messages from
multiple Redis streams, sorting them based on the timestamp in the
Stream ID, and ejecting them back.

This is used for the callback stage, required by Icinga Notification. In
the Notification context, an ordered stream is required.

Despite my best intention, it felt like I have created an Erlang.
Rework the prior custom variable fetching code to no longer fetch
everything in a looping fashion from Redis, but send SQL queries for
custom variables now.

In addition, for service objects now contain both the service and host
custom variables, prefixed by "host.vars." or "service.vars.".
Refactor multiple variables into common struct to ease handling.
- Ignore the "config" part of the JSON struct which is only relevant for
  Icinga DB Web.
- Remove unnecessary string conversions.
- Small code changes/improvements.
Effectively reverting cf4bd92 and
passing a pointer to the relevant com.Counter to the history sync.
With parseRedisStreamId being in the "hot path", the quite simple
regular expression was exchanged with an even simpler string split.
However, as network operations precede and follow this, the benefit of
this optimization might be questionable.
- Store the streamSorterSubmission submission time as a time.Time
  instead of a nanosecond timestamp, comparing the time.Timer's
  monotonic clock.
- Replace time-based buckets in StreamSorter.submissionWorker by a heap
  to be pushed and popped. However, only submissions of a certain age
  are being forwarded. Reduces complexity quite a bit.
- Reduce complexity of StreamSorter.queueWorker by getting rid of
  unnecessary channel signals by checking for new queue events for
  processing at the loop start.
Introduce the StreamSorter.CloseOutput method to remove all submissions
for a certain output channel from both workers before closing the
channel.

The motivation behind this change is to have a single point where the
output channel is closed while no submissions are being sent into an
already closed channel.
Next to minor changes, the custom variables are now fetched from
customvar_flat, being in their flat format.
Next to some other small cleanups, the streamSorterSubmissions slice
type now references pointers.
…ementation

This commit is pretty much an overhaul of the implementation to allow for a
more straight-forward way to close the output channel. The main changes to the
implementation are:

- StreamSorter now provides a method PipelineFunc that can directly be used in
  a history sync pipeline. This allows StreamSorter to handle the in + out
  stream pair internally, so that it closes out after in was closed and all
  messages from it were passed to out.
- The two worker goroutines were combined into a single one and the secondary
  queue was removed. All pending messages remain in the heap and will only be
  removed from the heap when they are about to be passed to the callback.
- The worker now handles all operations (send and close) on the output stream.
After Julian reworked big parts of the StreamSorter for the better, I
went over the code multiple times and added a few comments for parts I
had to read twice.

Within the tests, there might be a data race when zaptest is used after
the test's context is done. Since there were a few log messages
potentially occurring after the test's end, a guard was added to ensure
no verbose log messages are being produced if the context is done.
The parameters can not only be strings, but anything to PHP's liking. In
one example, an integer was observed. Since Parameters is converted to
an []any later anyways, this is no real change in behavior.
The whole StreamSorter logic is only required for Icinga Notifications.
Thus, the implementation was moved from the history package to the
notifications package, removing some unnecessary generalizations on the
way. This results in big changes to be made in the notifications
package, while other modules are mostly not affected.
Change the message for TypeAcknowledgementCleared events to a more
obvious one.
Populate the Event's Mute field for muting and unmuting for flapping
events and acknowledgements.
Allow configurable timeouts for the StreamSorter, to set them to a
fraction of their default for the tests. Now the tests are done in three
seconds instead of three minutes.

While doing so, another race condition with the test logging was
unveiled. Since this race results from a closing test context and test
logger, there was not much to do and I decided to just drop the logging
message, which was used only for tests anyway.
Remove the generic information from 01-About.md and add a warning to
03-Configuration.md that this feature might not be stable.
Drop the "-source" suffix from the configuration option. Furhtermore,
with the latest IGL release 0.8.1[0], the "api-base-url" is renamed to
just "url".

[0]: Icinga/icinga-go-library#168
@oxzi oxzi force-pushed the icinga-notifications-init branch from fea8170 to 327862c Compare November 17, 2025 08:22
@oxzi oxzi marked this pull request as ready for review November 17, 2025 08:22
@julianbrost julianbrost merged commit ff1c49e into main Nov 17, 2025
31 checks passed
@julianbrost julianbrost deleted the icinga-notifications-init branch November 17, 2025 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants