Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 141 additions & 10 deletions docs/setup_installation/admin/ha-dr/dr.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,15 +121,22 @@ For S3 object storage, you can also configure a bucket lifecycle policy to expir

## Restore

Hopsworks supports two restore modes:

- **New cluster restore**: Install a fresh cluster and restore data from a backup during installation.
- **In-place restore**: Restore data onto an existing running cluster via `helm upgrade`.

!!! Note
Restore is only supported in a newly created cluster; in-place restore is not supported. Use the exact Hopsworks version that was used to create the backup.
Use the exact Hopsworks version that was used to create the backup.

The restore process has two phases:
### New Cluster Restore

The new cluster restore process has two phases:

- Restore Kubernetes objects required for the cluster restore.
- Install the cluster with Helm using the correct backup IDs.

### Restore Kubernetes objects
#### Restore Kubernetes objects

Restore the Kubernetes objects that were backed up using Velero.

Expand Down Expand Up @@ -202,19 +209,18 @@ done

# Restores the latest - if specific backup is needed then backupName instead
echo "=== Creating Velero Restore object for k8s-backups-main ==="
RESTORE_SUFFIX=$(date +%s)
kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: k8s-backups-main-restore-$RESTORE_SUFFIX
name: k8s-backups-main
namespace: velero
spec:
scheduleName: k8s-backups-main
EOF

echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-main-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
until [ "$(kubectl get restore k8s-backups-main -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
echo "Still waiting..."; sleep 5;
done

Expand All @@ -224,14 +230,14 @@ kubectl apply -f - <<EOF
apiVersion: velero.io/v1
kind: Restore
metadata:
name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
name: k8s-backups-users-resources
namespace: velero
spec:
scheduleName: k8s-backups-users-resources
EOF

echo "=== Waiting for Velero restore to finish ==="
until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFFIX -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
until [ "$(kubectl get restore k8s-backups-users-resources -n velero -o jsonpath='{.status.phase}' 2>/dev/null)" = "Completed" ]; do
echo "Still waiting..."; sleep 5;
done
```
Expand All @@ -248,7 +254,7 @@ kubectl get configmap opensearch-backups-metadata -n hopsworks -o json \
| sort -nr
```

### Restore on Cluster installation
#### Restore on Cluster installation

To restore a cluster during installation, configure the backup ID in the values YAML file:

Expand All @@ -262,7 +268,7 @@ global:
backupId: "254811200"
```

#### Customizations
##### Customizations

!!! Warning
Even if you override the backup IDs for RonDB and Opensearch, you must still set `.global._hopsworks.restoreFromBackup.backupId` to ensure HopsFS is restored.
Expand Down Expand Up @@ -327,3 +333,128 @@ olk:
payload:
indices: "-myindex"
```

### In-Place Restore

In-place restore allows you to restore data onto an existing running cluster using `helm upgrade`. Unlike a new cluster restore, this does not require provisioning a fresh cluster — the existing stateful services are shut down, wiped if necessary, and restored from backup.

!!! Warning
In-place restore **replaces all existing data** in the cluster with the backup data. Any data written after the backup was taken will be lost.

!!! Info
After a fresh install from backup (new cluster restore), in-place restores can only be performed using backups taken **after** that fresh install, because the cluster certificates are regenerated during installation. To restore to a backup that was taken **before** the fresh install, you must perform another new cluster restore from that backup instead of an in-place restore.

#### In-place restore prerequisites

- A running Hopsworks cluster deployed via Helm.
- A previously created backup with a known backup ID.
- Object storage configured and accessible with the backup data.
- Velero installed and configured as described in the [prerequisites](#prerequisites).

#### Identify the backup ID

Get the backup ID from the **Cluster Settings > Backup** tab or by using the following commands.

```bash
# RonDB backup IDs (newest first)
kubectl get configmap rondb-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr

# Opensearch backup IDs (newest first)
kubectl get configmap opensearch-backups-metadata -n hopsworks -o json \
| jq -r '.data | to_entries[] | select(.value | fromjson | .state == "SUCCESS") | .key' \
| sort -nr

# Velero backup IDs for the main schedule (newest first)
kubectl get backups -n velero -o json \
| jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl" and .metadata.labels["velero.io/schedule-name"] == "k8s-backups-main" and .status.phase == "Completed")] | sort_by(.status.completionTimestamp) | reverse[] | .metadata.name'

# Velero backup IDs for the users schedule (newest first)
kubectl get backups -n velero -o json \
| jq -r '[.items[] | select(.spec.storageLocation == "hopsworks-bsl" and .metadata.labels["velero.io/schedule-name"] == "k8s-backups-users-resources" and .status.phase == "Completed")] | sort_by(.status.completionTimestamp) | reverse[] | .metadata.name'
```

#### Run the in-place restore

Configure the restore in the values file and run `helm upgrade`:

```yaml
global:
_hopsworks:
backups:
enabled: true
schedule: "@weekly"
restoreFromBackup:
backupId: "254811200"
inPlace: true
forceDataClear: true

# Optional: specify Velero backup IDs. If not set, the latest completed backup is used.
hopsworks:
velero:
restore:
mainScheduleBackupId: "k8s-backups-main-20260213T153627Z"
usersScheduleBackupId: "k8s-backups-users-resources-20260213T153627Z"
```

Then run:

```bash
helm upgrade hopsworks hopsworks/hopsworks --version <CHART_VERSION> \
--namespace hopsworks \
-f values.yaml \
--timeout 1200s
```

You can also pass the restore flags directly on the command line:

```bash
helm upgrade hopsworks hopsworks/hopsworks --version <CHART_VERSION> \
--namespace hopsworks \
--set-string global._hopsworks.restoreFromBackup.backupId="254811200" \
--set global._hopsworks.restoreFromBackup.inPlace=true \
--set global._hopsworks.restoreFromBackup.forceDataClear=true \
--set-string hopsworks.velero.restore.mainScheduleBackupId="k8s-backups-main-20260213T153627Z" \
--set-string hopsworks.velero.restore.usersScheduleBackupId="k8s-backups-users-resources-20260213T153627Z" \
--timeout 1200s
```

The required flags are:

| Parameter | Description |
| --------- | ----------- |
| `global._hopsworks.restoreFromBackup.backupId` | The backup ID to restore from. |
| `global._hopsworks.restoreFromBackup.inPlace` | Must be `true` to enable in-place restore mode. |
| `global._hopsworks.restoreFromBackup.forceDataClear` | Must be `true` to confirm that existing data will be replaced. This is a safety mechanism to prevent accidental data loss. |

The following flags are optional. If not set, the latest available Velero backup will be used:

| Parameter | Description |
| --------- | ----------- |
| `hopsworks.velero.restore.mainScheduleBackupId` | The Velero backup ID for the main schedule (`k8s-backups-main`). |
| `hopsworks.velero.restore.usersScheduleBackupId` | The Velero backup ID for the users schedule (`k8s-backups-users-resources`). |

#### Re-running an in-place restore

In-place restore creates marker resources to prevent accidental re-runs. If you need to run the restore again with the same backup ID, delete the marker resources first:

```bash
# Delete the HopsFS restore job
kubectl delete job hopsfs-inplace-restore-<BACKUP_ID> -n hopsworks --ignore-not-found=true

# Delete the RonDB restore jobs
kubectl delete job restore-native-backup-<BACKUP_ID> -n hopsworks --ignore-not-found=true
kubectl delete job setup-mysqld-dont-remove-<BACKUP_ID> -n hopsworks --ignore-not-found=true

# Delete the Opensearch restore job
kubectl delete job opensearch-restore-default-default-<BACKUP_ID> -n hopsworks --ignore-not-found=true

# Delete the velero restore objects, use the exact backup name or schedule name
kubectl delete restore.velero.io k8s-backups-main -n velero --ignore-not-found=true
kubectl delete restore.velero.io k8s-backups-users-resources -n velero --ignore-not-found=true
```

#### In-place restore customizations

The same customization options for [RonDB and Opensearch](#customizations) backup IDs apply to in-place restore. You can override individual service backup IDs while keeping the global backup ID for HopsFS.