Vault
Recover from lost quorum
With Integrated Storage, Raft quorum maintenance is a consideration for configuring and operating your Vault environment. A Vault cluster permanently loses quorum when there is no way to recover enough servers to reach consensus and elect a leader. Without a quorum of cluster servers, Vault can no longer perform read and write operations.
The cluster quorum is dynamically updated when new servers join the
cluster. Vault calculates quorum with the formula (n+1)/2
, where n
is the
number of servers in the cluster. For example, for a 3-server cluster, you will
need at least 2 servers operational for the cluster to function properly,
(3+1)/2 = 2
. Specifically, you will need 2 servers always active to perform read and write operations. (Review the deployment
table.)
Note
There is an exception to this rule if you use the -non-voter
option while joining the cluster. This feature is available only in Vault Enterprise editions.
Scenario overview
When two of the three servers encountered an outage, the cluster loses quorum and becomes inoperable.
Although one of the servers is fully functioning, the cluster won't be able to process read or write requests.
Example:
$ vault operator raft list-peers
No raft cluster configuration found
$ vault kv get kv/apikey
nil response from pre-flight request
In this tutorial, you will recover from the permanent loss of two-of-three Vault servers by converting it into a single-server cluster.
The last server must be fully operational to complete this procedure.
Note
Sometimes Vault loses quorum due to autopilot and servers marked as unhealthy but the service is still running. On unhealthy server(s), you must stop services before running the peers.json procedure.
In a 5 server cluster or in the case of non voters, you must stop other healthy before performing the peers.json recovery.
Locate the storage directory
On the healthy Vault server, locate the Raft storage directory. To discover the
location of the directory, review your Vault configuration file. The storage
stanza will contain the path
to the directory.
Example:
vault-config.hcl
storage "raft" {
path = "/vault/data"
server_id = "vault_1"
}
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
tls_disable = true
}
api_addr = "http://192.0.2.1:8200"
cluster_addr = "http://10.0.101.22:8201"
disable_mlock = true
ui=true
In this example, the path
is the file system path where Vault stores data, and the server_id
is the identifier for the server in the Raft cluster.
The example server_id
is vault_1
.
Create the peers.json file
Inside the storage directory (/vault/data
), there is a folder named raft
.
vault
βββ data
βββ raft
βΒ Β βββ raft.db
βΒ Β βββ snapshots
βββ vault.db
To enable the single, remaining Vault server to reach quorum and elect itself as
the leader, create a raft/peers.json
file that holds the server
information. The file format is a JSON array containing the server
ID, address:port, and suffrage information of the healthy Vault server
(for example, vault_1
).
Example:
$ cat > /vault/data/raft/peers.json << EOF
[
{
"id": "vault_1",
"address": "10.0.101.22:8201",
"non_voter": false
}
]
EOF
id (string: <required>)
- Specifies the server ID of the server.address (string: <required>)
- Specifies the host and port of the server. The port is the server's cluster port.non_voter (bool: <false>)
- This controls whether the server is a non-voter.
Restart Vault
Restart the Vault process to enable Vault to load the new peers.json
file.
$ sudo systemctl restart vault
Note
If you use Systemd, a SIGHUP
signal will not work.
Verify success
The recovery procedure is successful when Vault starts up and displays these messages in the system logs.
...snip...
[INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
[INFO] storage.raft: raft recovery initiated: recovery_file=peers.json
[INFO] storage.raft: raft recovery found new config: config="{[{Voter vault_1 https://10.0.101.22:8201}]}"
[INFO] storage.raft: raft recovery deleted peers.json
...snip...
Unseal Vault
If not configured to use auto-unseal, unseal Vault and then check the status.
Example:
$ vault operator unseal
Unseal Key (will be hidden):
$ vault status
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.7.3
Storage Type raft
Cluster Name vault-cluster-4a1a40af
Cluster ID d09df2c7-1d3e-f7d0-a9f7-93fadcc29110
HA Enabled true
HA Cluster https://10.0.101.22:8201
HA Mode active
Active Since 2021-07-20T00:07:32.215236307Z
Raft Committed Index 299
Raft Applied Index 299
View the peer list
You now have a cluster with one server that can reach the quorum. Verify that there is just one server in the cluster with vault operator raft list-peers
command.
$ vault operator raft list-peers
server Address State Voter
---- ------- ----- -----
vault_1 https://10.0.101.22:8201 leader true
Next steps
In this tutorial, you recovered the loss of quorum by converting a 3-server
cluster into a single-server cluster using the peers.json
. The peers.json
file enabled you to manually overwrite the Raft peer list to the one remaining
server, which allowed that server to reach quorum and complete a leader
election.
If the failed servers are recoverable, the best option is to bring them back
online and have them reconnect to the cluster using the same host addresses.
This will return the cluster to a fully healthy state. In such an event, the
raft/peers.json
should contain the server ID, address:port, and suffrage
information of each Vault server you wish to be in the cluster.
[
{
"id": "server1",
"address": "server1.vault.local:8201",
"non_voter": false
},
{
"id": "server2",
"address": "server2.vault.local:8201",
"non_voter": false
},
{
"id": "server3",
"address": "server3.vault.local:8201",
"non_voter": false
}
]
See the Outage Recovery documentation for more detail.