Initial Icinga Notifications Source #998

oxzi · 2025-07-29T14:45:21Z

This is the first version to use Icinga DB as an event source for Icinga Notifications. If configured accordingly, Icinga DB forwards events crafted from the Redis pipeline to the Icinga Notifications API.

This required a small refactoring of the history synchronization to allow hooking into the Redis stream. Afterwards, the newly introduced notifications package handles the rest.

Note: As part of this architectural change, Icinga Notifications offers filters to be evaluated by Icinga DB. At the moment, these are SQL queries being executed on the Icinga DB relational database. Either consider both Icinga DB and Icinga Notifications to be part of the same trust domain or consider the security implications.

Furthermore, this change requires a change on Icinga Notifications as well. This will not work with the current version 0.1.1.

Other PRs are

yhabteab · 2025-08-08T14:07:42Z

I've changed quite a few things since your last push, so I'll summarize the changes here for you:

You previously used the any type to represent the history entries streamed from the history pipelines, but since we know they are going to be one of the v1/history types, I've replaced any with our database.Entity type, which is the common type for all our v1 database models. I've pushed this change directly into your commit, so you won't find extra commit for this change.
Instead of having to maintain duplicate types of the Icinga Notifications Event, Severity etc. types, I've taken the liberty to outsource them into our Icinga Go Library ¹, so that they can be used by all Icinga Notifications sources (ours or external ones) built in Go. Additionally, I've introduced there a common Icinga Notifications client that can be used to interact with the Icinga Notifications API in a common way. Thus, some of the methods you previously implemented on the Source type have been moved to the new client with some minor adjustments to make them more generic and reusable. So, 92b4e4a is the commit that makes use of that new client and also removes the now obsolete types and methods from the pkg/notifications package.
I know it's only for the time being, but you previously limited the query parameters to be used for the SQL queries of the event rules config ObjectFilter to only {host,service,environment}_id but since the v1/history types provide numerous other fields that can be used for filtering, I've removed this limitation and now the whole entity is passed to the sqlx query builder, so that any field can be used for filtering as long as it is supported by that concrete entity. You'll find this change in the commit 4568b93.
Since the v1 entities don't have access to the actual Host or Service names, you've previously performed SQL queries to the hosts and services tables to get the names of the hosts and services, but since these database queries were triggered with incoming history entry, I found it a bit inefficient, so I've changed that to retrieve the names from Redis via the HGET command instead, which doesn't require SQL parsing and round trip to the database. I believe this will bring a significant performance improvement, especially in a big Icinga environment with numerous objects. If you don't see it that way, you can also revert it, and we'll discuss it further when I'm back. Performance-wise, there's still a room for improvement though, because when starting Icinga 2 with a lot of objects, Icinga Notifications still takes a while to actually receive a state change event triggered 1-3m after Icinga DB Web shows it in the history view (e.g. when you submit a process-check-result with OK state, the respective incident in Icinga Notifications won't be closed until 1-3m later). The commit that implements this change is b01e61c.
Previously, when Icinga Notifications rejected the process event request with outdated rules version, you have posted the current event in a background in a go s.Submit() manner, so that it can be re-processed by the worker and re-evaluate the newly updated rules. However, since the workers input channel is buffered, this can mess up the order of the events when there are already other events waiting in the queue. So, I've changed this to directly re-evaluate the event using a magic but simple goto ... statement. The commit that implements this change is 55753b5.
Finally, I've dropped the rulesMutex mutex that was used to protect the rules map, since it is no longer needed. May be you've planned to parallelize the event submission at some point, but at the moment, there's no concurrent access to it, so I've dropped it entirely with 3549849 for the time being.

Also, all the individual commits are self-contained, and includes the relevant commit message with some details about
the changes, so you can always refer to them to see the justifications for the changes.

https://github.com/Icinga/icinga-go-library/pull/145 ↩

julianbrost

Apart from the inline comments, there's also a more general question that still needs consideration: error handling. In the sense that what happens if submitting the event to Icinga Notifications doesn't work. With the current implementation, it just discarded and thereby lost. That would be solved if the submission was synchronous in the history pipeline, just as writing the history to the SQL database currently is. In that case, the event would only be deleted from Redis after it was both written to the database and sent to Icinga Notifications (on the other hand, if Icinga Notifications is unavailable, this would probably prevent history from being written).

This would also solve a related issue currently present in the implementation: once it's written to the Go channel, it's considered done. If Icinga DB is stopped, there may be still events in flight and there seems to be nothing that would wait for them. So unless I'm missing something, a clean restart of the Icinga DB process might result in lost events.

pkg/icingadb/history/sync.go

pkg/notifications/redis_fetch.go

pkg/notifications/notifications.go

Changed in <Icinga/icingadb#998>.

julianbrost

General question on something I didn't think about before: the history is passed over multiple streams that are processed independently. What consideration did you put into this? At first glance, this sounds like a nasty source of race conditions to me if things happen out of order.

pkg/icingaredis/telemetry/stats.go

cmd/icingadb/main.go

pkg/notifications/notifications.go

julianbrost · 2025-10-16T08:55:43Z

pkg/notifications/notifications.go

+	case "downtime_end":
+		if h.HasBeenCancelled.Valid && h.HasBeenCancelled.Bool {
+			ev.Type = event.TypeDowntimeRemoved
+			ev.Message = "Downtime was cancelled"
+
+			if h.CancelledBy.Valid {
+				ev.Username = h.CancelledBy.String
+			}
+		} else {
+			ev.Type = event.TypeDowntimeEnd
+			ev.Message = "Downtime expired"
+		}


Where's the unmute happening? And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?

Where's the unmute happening?

Looks like you've added a ev.Mute = types.MakeBool(false) in case "downtime_end" to address this.

However, buildFlappingHistoryEvent() has the same issue:

icingadb/pkg/notifications/notifications.go

Lines 295 to 319 in d675257

// buildFlappingHistoryEvent from a flapping history entry.

func (client *Client) buildFlappingHistoryEvent(ctx context.Context, h *v1history.FlappingHistory) (*event.Event, error) {

ev, _, err := client.buildCommonEvent(ctx, h.HostId, h.ServiceId)

if err != nil {

return nil, errors.Wrapf(err, "cannot build event for %q,%q", h.HostId, h.ServiceId)

}

if h.PercentStateChangeEnd.Valid {

ev.Type = event.TypeFlappingEnd

ev.Message = fmt.Sprintf(

"Checkable stopped flapping (Current flapping value %.2f%% < low threshold %.2f%%)",

h.PercentStateChangeEnd.Float64, h.FlappingThresholdLow)

} else if h.PercentStateChangeStart.Valid {

ev.Type = event.TypeFlappingStart

ev.Message = fmt.Sprintf(

"Checkable started flapping (Current flapping value %.2f%% > high threshold %.2f%%)",

h.PercentStateChangeStart.Float64, h.FlappingThresholdHigh)

ev.Mute = types.MakeBool(true)

ev.MuteReason = "Checkable is flapping"

} else {

return nil, errors.New("flapping history entry has neither percent_state_change_start nor percent_state_change_end")

}

return ev, nil

}

And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?

That part seems unaddressed though. That's a more fundamental issue and I don't see anything in the implementation that would handle nested muting by downtimes and flapping in any way (nor do I see an easy way to add it). Am I missing something here? (Or is this basically a limitation of getting the events this way?)

However, buildFlappingHistoryEvent() has the same issue:

Yes, I have added an unmute there as well.

Furthermore, the current acknowledgment handling has no muting or unmuting at all, while there was a conditional muting/unmuting in the old implementation. So far, I have just added it always, but this brings us straight to the next topic.

And is there any replacement for the old isMuted() function that checks if the checkable is even to be unmuted?

That part seems unaddressed though. That's a more fundamental issue and I don't see anything in the implementation that would handle nested muting by downtimes and flapping in any way (nor do I see an easy way to add it). Am I missing something here? (Or is this basically a limitation of getting the events this way?)

I am a bit unsure how to handle this. With this "stateless" pipeline approach, I don't see an easy way to address this. In theory, I could somehow dig most of the information from Redis, but this wouldn't be beautiful, and potentially expensive.

pkg/icingadb/history/sync.go

oxzi · 2025-10-22T09:03:38Z

Note: After talking with @lippserd, we realized that the custom vars are required for permission filtering in Web. I would check how to add the custom vars from Icinga DB in a "quick fix" manner.

nilmerg · 2025-10-24T10:04:10Z

Icinga/icingadb-web#1289 now prepares this as filter:

{
  "version": 1,
  "config": {
    "type": "all",
    "filter": "host.name=docker-master"
  },
  "queries": {
    "host": {
      "query": "SELECT (1) FROM host WHERE ((host.id = ?) AND (host.environment_id = ?)) AND (host.name = ?) LIMIT 1",
      "parameters": [":host_id", ":environment_id", "docker-master"]
    },
    "service": {
      "query": "SELECT (1) FROM service LEFT JOIN host service_host ON service_host.id = service.host_id WHERE ((service.id = ?) AND (service.environment_id = ?)) AND (service_host.name = ?) LIMIT 1",
      "parameters": [":service_id", ":environment_id", "docker-master"]
    }
  }
}

config.type may also be just "host" or "service" and queries then only contains a single respective key.

The rules are no longer just plain SQL queries, but have now their own JSON format, introduced by Icinga DB Web. This format is now supported by Client.evaluateRulesForObject. - Icinga/icingadb-web#1289 - #998 (comment)

nilmerg · 2025-10-27T16:46:52Z

Note: After talking with @lippserd, we realized that the custom vars are required for permission filtering in Web. I would check how to add the custom vars from Icinga DB in a "quick fix" manner.

I thought about this again, and noticed another limitation: Transferring custom variables 1:1 might be enough if users want to match hosts with host vars and services with service vars, but the following is not possible:

Matching services with host vars
This is a common thing, as declaring a host variable and use it in apply rules to apply services makes totally sense to also use as restriction. Though with the 1:1 mapping, services only get their own variables assigned, not those of their host.
Matching only hosts or services, where their service's variables or their host's variables (respectively) match
Consider host.vars.os=Linux. Using this to filter services only matches services whose host has set os=Linux
Also: service.vars.port=443. Using this to filter hosts only matches hosts who have a service that has set port=443
With the 1:1 mapping and no way to differentiate the object type, object.vars.os=Linux matches also services that are not related to a host with this variable, but have their own variable set due the first case

It's not possible to re-use existing restrictions by simply copying them over. Either because of missing data (Case 1.) or because of false positives (Case 2.).

Case 1 can be fixed by just transferring all host vars to all of their services as well. We are already dealing with huge amounts duplicated rows though…
Case 2 can only be partially fixed by introducing an extra ID tag type and then doing: object.tag.type=host&object.vars.os=Linux. This is even more important when considering the fix for Case 1. Only partially because there is no way to express relationship checks, we cannot support matching objects based on another object's configuration. (We might, by introducing a dynamic join allowing to join objects where a subset of ID tags match, but I don't want to see the query plan for this monstrosity if multiple extra tags are part of a restriction.)

But still, no simple way to re-use existing restrictions. But at least somewhat more usable than just a 1:1 mapping.

Refactor the telemetry.Stats to allow custom names. This enabled dynamic callback names for the Redis history sync, used by Icinga Notifications.

- Don't validate notifications config in a background Goroutine. - Clip pipeline slice to avoid reusing capability twice. - Rework notification Client.buildCommonEvent and depending methods. - Resubmit events after updating rules in one go. - Simplify Client.fetchHostServiceName based on Julian's suggestion. Co-Authored-By: Julian Brost <julian.brost@icinga.com>

After reintroducing Event.ExtraTags in the IGL and Icinga Notifications, Icinga DB populates events by their custom variables. At the moment, the required customvars are fetched from Redis for each event. Due to the Redis schema, at least on HGETALL with manual filtering is required. This might be a good candidate for further caching, and cache invalidation.

The rules are no longer just plain SQL queries, but have now their own JSON format, introduced by Icinga DB Web. This format is now supported by Client.evaluateRulesForObject. - Icinga/icingadb-web#1289 - #998 (comment)

The StreamSorter was added to history, allowing to collect messages from multiple Redis streams, sorting them based on the timestamp in the Stream ID, and ejecting them back. This is used for the callback stage, required by Icinga Notification. In the Notification context, an ordered stream is required. Despite my best intention, it felt like I have created an Erlang.

Rework the prior custom variable fetching code to no longer fetch everything in a looping fashion from Redis, but send SQL queries for custom variables now. In addition, for service objects now contain both the service and host custom variables, prefixed by "host.vars." or "service.vars.".

Refactor multiple variables into common struct to ease handling.

- Ignore the "config" part of the JSON struct which is only relevant for Icinga DB Web. - Remove unnecessary string conversions. - Small code changes/improvements.

Effectively reverting cf4bd92 and passing a pointer to the relevant com.Counter to the history sync.

With parseRedisStreamId being in the "hot path", the quite simple regular expression was exchanged with an even simpler string split. However, as network operations precede and follow this, the benefit of this optimization might be questionable.

- Store the streamSorterSubmission submission time as a time.Time instead of a nanosecond timestamp, comparing the time.Timer's monotonic clock. - Replace time-based buckets in StreamSorter.submissionWorker by a heap to be pushed and popped. However, only submissions of a certain age are being forwarded. Reduces complexity quite a bit. - Reduce complexity of StreamSorter.queueWorker by getting rid of unnecessary channel signals by checking for new queue events for processing at the loop start.

Introduce the StreamSorter.CloseOutput method to remove all submissions for a certain output channel from both workers before closing the channel. The motivation behind this change is to have a single point where the output channel is closed while no submissions are being sent into an already closed channel.

Seems to be forgotten in e475a5e.

Next to minor changes, the custom variables are now fetched from customvar_flat, being in their flat format.

Next to some other small cleanups, the streamSorterSubmissions slice type now references pointers.

…ementation This commit is pretty much an overhaul of the implementation to allow for a more straight-forward way to close the output channel. The main changes to the implementation are: - StreamSorter now provides a method PipelineFunc that can directly be used in a history sync pipeline. This allows StreamSorter to handle the in + out stream pair internally, so that it closes out after in was closed and all messages from it were passed to out. - The two worker goroutines were combined into a single one and the secondary queue was removed. All pending messages remain in the heap and will only be removed from the heap when they are about to be passed to the callback. - The worker now handles all operations (send and close) on the output stream.

After Julian reworked big parts of the StreamSorter for the better, I went over the code multiple times and added a few comments for parts I had to read twice. Within the tests, there might be a data race when zaptest is used after the test's context is done. Since there were a few log messages potentially occurring after the test's end, a guard was added to ensure no verbose log messages are being produced if the context is done.

The parameters can not only be strings, but anything to PHP's liking. In one example, an integer was observed. Since Parameters is converted to an []any later anyways, this is no real change in behavior.

The whole StreamSorter logic is only required for Icinga Notifications. Thus, the implementation was moved from the history package to the notifications package, removing some unnecessary generalizations on the way. This results in big changes to be made in the notifications package, while other modules are mostly not affected.

Change the message for TypeAcknowledgementCleared events to a more obvious one.

Populate the Event's Mute field for muting and unmuting for flapping events and acknowledgements.

Allow configurable timeouts for the StreamSorter, to set them to a fraction of their default for the tests. Now the tests are done in three seconds instead of three minutes. While doing so, another race condition with the test logging was unveiled. Since this race results from a closing test context and test logger, there was not much to do and I decided to just drop the logging message, which was used only for tests anyway.

Remove the generic information from 01-About.md and add a warning to 03-Configuration.md that this feature might not be stable.

Drop the "-source" suffix from the configuration option. Furhtermore, with the latest IGL release 0.8.1[0], the "api-base-url" is renamed to just "url". [0]: Icinga/icinga-go-library#168

oxzi added this to the 1.5.0 milestone Jul 29, 2025

oxzi added enhancement New feature or request area/notifications labels Jul 29, 2025

oxzi mentioned this pull request Jul 29, 2025

Move Rule evaluation to Source Icinga/icinga-notifications#324

Merged

cla-bot bot added the cla/signed label Jul 29, 2025

oxzi marked this pull request as draft July 29, 2025 14:45

julianbrost assigned yhabteab Aug 5, 2025

yhabteab force-pushed the icinga-notifications-init branch 2 times, most recently from 98939bd to 25677ca Compare August 8, 2025 13:18

yhabteab assigned oxzi and unassigned yhabteab Aug 8, 2025

oxzi mentioned this pull request Aug 28, 2025

Event rule config enhancement Icinga/icinga-notifications-web#159

Merged

julianbrost reviewed Sep 2, 2025

View reviewed changes

oxzi force-pushed the icinga-notifications-init branch 4 times, most recently from 6dd9d96 to a199863 Compare September 9, 2025 10:10

oxzi requested a review from julianbrost September 9, 2025 13:02

oxzi mentioned this pull request Sep 16, 2025

Import notifications packages Icinga/icinga-go-library#145

Merged

oxzi added a commit to Icinga/icinga2 that referenced this pull request Sep 25, 2025

IcingadbCheckTask: Add callback_sync stats

75d28b4

Changed in <Icinga/icingadb#998>.

oxzi mentioned this pull request Sep 25, 2025

IcingadbCheckTask: Add callback_sync stats Icinga/icinga2#10567

Draft

julianbrost reviewed Oct 16, 2025

View reviewed changes

oxzi force-pushed the icinga-notifications-init branch from ed7b0d1 to 0ab4f90 Compare October 21, 2025 14:40

oxzi force-pushed the icinga-notifications-init branch from 48248d9 to 214bfe2 Compare October 22, 2025 16:15

oxzi mentioned this pull request Oct 27, 2025

Provide integration for Icinga Notifications Web Icinga/icingadb-web#1289

Merged

oxzi and others added 25 commits November 17, 2025 09:20

Configurable callback sync telemetry stat name

ad26a78

Refactor the telemetry.Stats to allow custom names. This enabled dynamic callback names for the Redis history sync, used by Icinga Notifications.

notifications: Evaluate Icinga DB Web Rule Filter

b8e11b3

The rules are no longer just plain SQL queries, but have now their own JSON format, introduced by Icinga DB Web. This format is now supported by Client.evaluateRulesForObject. - Icinga/icingadb-web#1289 - #998 (comment)

config.example.yml: Reformat notifications-source

aa83338

history: SyncCallbackConf For Common Callback Conf

8baa5e7

Refactor multiple variables into common struct to ease handling.

notifications: Simplify Icinga DB Web Rule Evaluation

a8ff0a2

- Ignore the "config" part of the JSON struct which is only relevant for Icinga DB Web. - Remove unnecessary string conversions. - Small code changes/improvements.

telemetry: Undo Stats rework

45877e7

Effectively reverting cf4bd92 and passing a pointer to the relevant com.Counter to the history sync.

history.parseRedisStreamId: Remove regex

1c2e14a

With parseRedisStreamId being in the "hot path", the quite simple regular expression was exchanged with an even simpler string split. However, as network operations precede and follow this, the benefit of this optimization might be questionable.

telemetry: Remove leftover Stats.Callback

d97e162

Seems to be forgotten in e475a5e.

notifications: Fix flat customvars

3e4f05d

Next to minor changes, the custom variables are now fetched from customvar_flat, being in their flat format.

history.streamSorterSubmissions: Use Pointer

18518cf

Next to some other small cleanups, the streamSorterSubmissions slice type now references pointers.

notifications.Client: Allow Parameters of any Type

c6368b1

The parameters can not only be strings, but anything to PHP's liking. In one example, an integer was observed. Since Parameters is converted to an []any later anyways, this is no real change in behavior.

notifications: TypeAcknowledgementCleared Message

46b1c6d

Change the message for TypeAcknowledgementCleared events to a more obvious one.

notifications: Mute and Unmute Events

2a5fde1

Populate the Event's Mute field for muting and unmuting for flapping events and acknowledgements.

doc: Update for Notifications

e8c0359

Remove the generic information from 01-About.md and add a warning to 03-Configuration.md that this feature might not be stable.

Rename notifications-source to notifications

327862c

Drop the "-source" suffix from the configuration option. Furhtermore, with the latest IGL release 0.8.1[0], the "api-base-url" is renamed to just "url". [0]: Icinga/icinga-go-library#168

oxzi force-pushed the icinga-notifications-init branch from fea8170 to 327862c Compare November 17, 2025 08:22

oxzi marked this pull request as ready for review November 17, 2025 08:22

julianbrost approved these changes Nov 17, 2025

View reviewed changes

julianbrost merged commit ff1c49e into main Nov 17, 2025
31 checks passed

julianbrost deleted the icinga-notifications-init branch November 17, 2025 12:46

	// buildFlappingHistoryEvent from a flapping history entry.
	func (client Client) buildFlappingHistoryEvent(ctx context.Context, h v1history.FlappingHistory) (*event.Event, error) {
	ev, _, err := client.buildCommonEvent(ctx, h.HostId, h.ServiceId)
	if err != nil {
	return nil, errors.Wrapf(err, "cannot build event for %q,%q", h.HostId, h.ServiceId)
	}

	if h.PercentStateChangeEnd.Valid {
	ev.Type = event.TypeFlappingEnd
	ev.Message = fmt.Sprintf(
	"Checkable stopped flapping (Current flapping value %.2f%% < low threshold %.2f%%)",
	h.PercentStateChangeEnd.Float64, h.FlappingThresholdLow)
	} else if h.PercentStateChangeStart.Valid {
	ev.Type = event.TypeFlappingStart
	ev.Message = fmt.Sprintf(
	"Checkable started flapping (Current flapping value %.2f%% > high threshold %.2f%%)",
	h.PercentStateChangeStart.Float64, h.FlappingThresholdHigh)
	ev.Mute = types.MakeBool(true)
	ev.MuteReason = "Checkable is flapping"
	} else {
	return nil, errors.New("flapping history entry has neither percent_state_change_start nor percent_state_change_end")
	}

	return ev, nil
	}

Initial Icinga Notifications Source #998

Initial Icinga Notifications Source #998

Uh oh!

Conversation

oxzi commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhabteab commented Aug 8, 2025

Footnotes

Uh oh!

julianbrost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julianbrost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julianbrost Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

julianbrost Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

oxzi Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oxzi commented Oct 22, 2025

Uh oh!

nilmerg commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nilmerg commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

oxzi commented Jul 29, 2025 •

edited

Loading

nilmerg commented Oct 24, 2025 •

edited

Loading