You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[WIP] Added health checks section for virtual routers (#89)
* [WIP] updates to service monitoring script
* Added health checks section
* Some formatting
* Some formatting
* Some more formatting
* Missing space for numbering
* Update systemvm.rst
Co-authored-by: Andrija Panic <45762285+andrijapanicsb@users.noreply.github.com>
Co-authored-by: Anurag Awasthi <anurag.awasthi@shapeblue.com>
#. Head to the website of your favorite trusted Certificate
244
+
#. Head to the website of your favourite trusted Certificate
245
245
Authority, purchase an SSL certificate, and submit the CSR. You
246
246
should receive a valid certificate in return
247
247
@@ -304,11 +304,11 @@ If you still have problems and folowing errors in management.log while destroyin
304
304
- Unable to build keystore for CPVMCertificate due to CertificateException
305
305
- Cold not find and construct a valid SSL certificate
306
306
307
-
that means that still some of the Root/intermediate/server certificates or the key is not in a good format, or incorrectly encoded or multiply Root CA/Intemediate CA present in database by mistake.
307
+
that means that still some of the Root/intermediate/server certificates or the key is not in a good format, or incorrectly encoded or multiply Root CA/Intermediate CA present in database by mistake.
308
308
309
309
Other way to renew Certificates (Root,Intermediates,Server certificates and key) - although not recommended
310
310
unless you fill comfortable - is to directly edit the database,
311
-
while still respect the main requirement that the private key is PKCS8 encoded, while Root CA, Intemediate and Server certificates
311
+
while still respect the main requirement that the private key is PKCS8 encoded, while Root CA, Intermediate and Server certificates
312
312
are still in default PEM format (no URL encoding needed here).
313
313
After editing the database, please restart management server, and destroy SSVM and CPVM after that,
314
314
so the new SSVM and CPVM with new certificates are created.
@@ -411,7 +411,7 @@ Service Monitoring Tool for Virtual Router
411
411
Various services running on the CloudStack virtual routers can be
412
412
monitored by using a Service Monitoring tool. The tool ensures that
413
413
services are successfully running until CloudStack deliberately disables
414
-
them. If a service goes down, the tool automatically restarts the
414
+
them. If a service goes down, the tool automatically attempts to restart
415
415
service, and if that does not help bringing up the service, an alert as
416
416
well as an event is generated indicating the failure. A new global
417
417
parameter, ``network.router.enableservicemonitoring``, has been
@@ -430,7 +430,7 @@ an unexpected reason. For example:
430
430
.. note::
431
431
Only those services with daemons are monitored. The services that are
432
432
failed due to errors in the service/daemon configuration file cannot
433
-
be restarted by the Monitoring tool.
433
+
be restarted by the Monitoring tool. VPC Networks are supported (as of CloudStack 4.14)
434
434
435
435
The following services are monitored in a VR:
436
436
@@ -453,11 +453,212 @@ The following networks are supported:
453
453
This feature is supported on the following hypervisors: XenServer,
454
454
VMware, and KVM.
455
455
456
-
Log file /var/log/routerServiceMonitor.log contains the actions undertaken/attempted by the service monitoring script (i.e. trying to restart a stopped service).
456
+
Log file /var/log/routerServiceMonitor.log contains the actions undertaken/attempted
457
+
by the service monitoring script (i.e. trying to restart a stopped service).
457
458
458
-
As of CloudStack 4.14, the internval at which the service monitoring script runs is no more hardcoded to 3 minutes, but is instead controlled via global setting router.health.checks.basic.interval.
459
+
As of CloudStack 4.14, the interval at which the service monitoring script runs
460
+
is no more hardcoded to 3 minutes, but is instead controlled via
461
+
global setting router.health.checks.basic.interval.
459
462
460
463
464
+
Health checks for Virtual Router
465
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
466
+
467
+
In addition to monitoring services as of 4.14 CloudStack adds a framework
468
+
for more extensive health checks. The health checks are split into two
469
+
categories - basic and advanced. The two categories have their own admin
470
+
definable intervals. The split is made this way as the advanced health checks
471
+
are considerably more expensive. The health checks will be available on-demand
472
+
via API as well as scheduled.
473
+
474
+
The following tests are covered: · Basic connectivity from the management server
475
+
to the virtual router
476
+
477
+
- Basic connectivity to virtual router its interfaces' gateways
478
+
479
+
- Free disk space on virtual router's disk
480
+
481
+
- CPU and memory usage
482
+
483
+
- Basic VR Sanity checks:
484
+
485
+
#. Ssh/dnsmasq/haproxy/httpd service running
486
+
487
+
- Advanced VR Sanity checks:
488
+
489
+
#. DHCP/DNS configuration matches mgmt server DB
490
+
491
+
#. IPtables rules match management server records
492
+
493
+
#. HAproxy config matches mgmt server DB records
494
+
495
+
#. VR Version against current version
496
+
497
+
498
+
This happens in the following steps:
499
+
500
+
1. Management server periodically pushes data to each running virtual router
501
+
including schedule intervals, tests to skip, some configuration for LB, VMs,
502
+
Gateways, etc.
503
+
504
+
2. Basic and advanced tests as scheduled as per the intervals in the data sent
505
+
by Management server. Each run of checks populates it’s results and saves it
506
+
within the router at ‘/root/basic_monitor_results.json’ and
507
+
'/root/advance_monitor_results.json’. Each run of checks also keeps
508
+
track of the start time, end time, and duration of test run for better
509
+
understanding.
510
+
511
+
3. Each test is also available on demand via ' getRouterHealthCheckResults'
512
+
API added with the patch. The API can be executed from CLI and UI. Performing
513
+
fresh checks is expensive and will cause management server doing the following:
514
+
515
+
a. Refresh the data from Management server records on the router for
516
+
verification (repeat of step 1),
517
+
518
+
b. Run all the checks of both basic and advanced type,
519
+
520
+
c. Fetch the result of the health check from router to be sent back in response.
521
+
522
+
4. The patch also supports custom health checks with custom systemVM templates.
523
+
This is achieved as follows:
524
+
525
+
a. Each executable script placed in '/root/health_scripts/' is considered an
526
+
independent health check and is executed on each scheduled or on demand health check run.
527
+
528
+
b. The health check script can be in any language but executable (use 'chmod a+x')
529
+
within '/root/health_checks/' directory. The placed script must do the following:
530
+
531
+
#. Accept a command line parameter for check type (basic or advanced) - this
532
+
parameter is sent by the internal cron job in the VR (/etc/cron.d/process)
533
+
534
+
#. Proceed and perform checks as per the check type - basic or advanced
535
+
536
+
#. In order to be recognized as a health check and displayed in the list of health
537
+
checks results, it must print some message to STDOUT which is passed back as message
538
+
to management server - if the script doesn’t return anything on its STDOUT, it
539
+
will not be registered as a health check/displayed in the list of the health check results
540
+
541
+
#. exit with status of 0 if check was successful and exit with status of 1 if
542
+
check has failed
543
+
544
+
.. code:: bash
545
+
546
+
#!/bin/bash if [$1 == ‘advanced’] { do advance checks and print any message to STDOUT }
547
+
elseif [$1== ‘basic’] { do basic checks and print any message to STDOUT } exit(0) if pass or exit(1) if failure
548
+
549
+
#. i.e. if the script is intended to be i.e. a basic check, it must checks
550
+
for the presence of the 'basic' as the first parameter sent to it, and execute the
551
+
wanted commands and print some output to STDOUT; otherwise if it receives 'advanced'
552
+
as the first parameter, it should not execute any commands/logic nor print anything to STDOUT
553
+
554
+
5. There are 9 health check scripts written in default systemvm template in'/root/health_checks/'
555
+
folder. These indicate the health checks described in executive summary.
556
+
557
+
6. The management server will connect periodically to each virtual router to confirm that the
558
+
checks are running as scheduled, and retrieve the results of those checks. Any failing checks
559
+
present in``router.health.checks.failures.to.restart.vr`` will cause the VR to be recreated.
560
+
On each check management server will persist only the last executed check results in its database.
561
+
562
+
7. UI parses the returned health check results and shows the router 'Health Check'
563
+
column in'Failed'/'Passed'if there are health check failures of any type.
564
+
565
+
Following global configs have been added for configuring health checks:
566
+
567
+
- ``router.health.checks.enabled`` - If true, router health checks are allowed
568
+
to be executed and read. If false, all scheduled checks and API calls for on
569
+
demand checks are disabled. Default is true.
570
+
571
+
- ``router.health.checks.basic.interval`` - Interval in minutes at which basic
572
+
router health checks are performed. If set to 0, no tests are scheduled. Default
573
+
is 3 mins as per the pre 4.14 monitor services.
574
+
575
+
- ``router.health.checks.advanced.interval`` - Interval in minutes at which
576
+
advanced router health checks are performed. If set to 0, no tests are scheduled.
577
+
Default value is 10 minutes.
578
+
579
+
- ``router.health.checks.config.refresh.interval`` - Interval in minutes at which
580
+
router health checks config - such as scheduling intervals, excluded checks, etc
581
+
is updated on virtual routers by the management server. This value should be
582
+
sufficiently high (like 2x) from the router.health.checks.basic.interval and
583
+
router.health.checks.advanced.interval so that there is time between new results
584
+
generation for passed data. Default is 10 mins.
585
+
586
+
- ``router.health.checks.results.fetch.interval`` - Interval in minutes at which
587
+
router health checks results are fetched by management server. On each result fetch,
588
+
management server evaluates need to recreate VR as per configuration of
589
+
'router.health.checks.failures.to.recreate.vr'. This value should be sufficiently
590
+
high (like 2x) from the 'router.health.checks.basic.interval' and
591
+
'router.health.checks.advanced.interval' so that there is time between new
592
+
results generation and fetch.
593
+
594
+
- ``router.health.checks.failures.to.recreate.vr`` - Health checks failures defined
595
+
by this config are the checks that should cause router recreation. If empty the
596
+
recreate is not attempted for any health check failure. Possible values are comma
597
+
separated script names from systemvm’s /root/health_scripts/ (namely - cpu_usage_check.py,
0 commit comments