How to replace a etcd node
Avoid etcd issues when replacing an AWS instance
Recently ngineered went through an upgrade process of our coreOS instances that run on AWS and replaced them with the latest and greatest version. Whilst doing this we noticed an issue with etcd not rejoining the cluster when we rebuilt the AWS instance. We thought we'd share our findings with you in case you have the same issues.
We have a clustered etcd setup consisting of multiple etcd servers. We terminated one of the instances in order to replace it with a newer AMI. As we control the startup and configuration through cloud-config these servers are pretty immutable and can be replaced easily. However etcd did not agree with this and refused to start with some pretty cryptic messages.
So on digging into the messages it seemed like the server (lets call it server00) that we had rebuilt thought it was already a member of the cluster but the other members didn't agree.
Now if we logged onto the other etcd servers (that where still working) and used the etcdctl tool we got a little better view and idea of what was happening.
The first server appeared in the member list:
etcdctl member list
However it had a UID assigned to it that all the members agreed was the identity of server00. Now because we had rebuilt server00 from scratch, which included new EBS disks, server00 had no idea what the UID should be so generated a new one then tried to join its peers in the cluster. This of course was rejected as a server mis-match. etcd is designed to fail in this situation and that's exactly what it did!
In order to fix this it was pretty simple, first we had to log into an existing working server on the rest of the cluster and remove server00 from its member list:
etcdctl member remove <UID>
This free's up the ability to allow the new server00 to join but we needed to simply tell the cluster it could by issuing the add command:
etcdctl member add server00 http://220.127.116.11:2380
It you follow the logs on server00 you'll then see that everything spring into life. You can confirm this with the commands:
etcdctl member list etcdctl cluster-health
This should show you everything is now back as a member of the cluster and working fine.
The lesson learned from this is that you need to issue the member delete command before you try and bring up the new server.
I hope that helps solve some issues that other may have.