Skip to content

Snapshotting your Firecracker microVM

Published: at 12:00 AM

Articles in this Series


In the previous article, we set up guest connectivity to the internet. In this article, we’ll set up microVM snapshots. MicroVM snapshots allow you to initialize a microVM to a certain point and then snapshot the entire microVM at the point, saving everything to disk. In the future, you can resume a microVM from this snapshot, leading to much shorter cold start times (as long as the initialization time was the long pole of latency).

Table of contents

Open Table of contents

What a Firecracker snapshot is

A snapshot is just a frozen copy of everything a microVM needs to keep running: its guest memory plus the internal state of the VM itself. Firecracker splits this into two files:

  1. A memory file — a dump of the guest’s RAM.
  2. A microVM state file — everything else: the vCPU registers, the state of the emulated devices, and Firecracker’s bookkeeping.

Notice what’s not in there: the root filesystem and the kernel image. Those files are referenced in the snapshot, but data is loaded from them on-demand so you’ll have to provide them again during the microVM resume.

Demonstration

To demonstrate snapshots in action, we’re going to create a small Flask server that responds to GET /count by returning how many times it’s been called. The counter lives in memory, so it’ll be preserved across snapshots.

# root/counter.py
from flask import Flask, jsonify
app = Flask(__name__)
count = 0

@app.get("/count")
def get_count():
    global count
    count += 1
    return jsonify(count=count)

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

We’ll bake this server into the rootfs so that it’s started automatically when the microvm starts.

# /etc/init.d/counter-server
#!/sbin/openrc-run

name="counter-server"
description="In-memory hit counter (Flask) for the snapshot demo"
command="/usr/bin/python3"
command_args="/root/counter.py"
command_background=true
pidfile="/run/counter-server.pid"

depend() {
    need guest-networking
}

The command_background=true line tells OpenRC to daemonize the process for us, and depend() makes sure networking is up before the server starts.

Finally, we update the Dockerfile to install Flask and register the service in the default runlevel. counter.py lives in the root/ directory that’s already copied into the image, so there’s no new COPY to add:

  # Download dependencies
  RUN apk add --update --no-cache \
      openrc \
-     util-linux
+     util-linux \
+     python3 \
+     py3-flask

  RUN rc-update add boot-arg-logger \
-     && rc-update add guest-networking
+     && rc-update add guest-networking \
+     && rc-update add counter-server

Once the rootfs is rebuilt with this service registered, the guest comes up at 10.200.0.2:8000 and we can curl our counting service.

Taking a snapshot

Let’s start with firecracker booted with our new rootfs

rm -f /tmp/firecracker.socket \
&& /tmp/firecracker \
    --api-sock /tmp/firecracker.socket \
    --config-file ./vmconfig.json

Once it’s up, we can reach the counter from the host at 10.200.0.2:8000. Let’s hit it a few times so it has some in-memory state worth preserving:

$ curl -s http://10.200.0.2:8000/count
{"count":1}
$ curl -s http://10.200.0.2:8000/count
{"count":2}
$ curl -s http://10.200.0.2:8000/count
{"count":3}

The counter is now at 3, living in the Flask process’s memory. This is the state the snapshot will capture.

Create the snapshot

In order to take a snapshot, we first have to pause a microVM. Afterwards, we can create a snapshot via the PUT /snapshot/create API, which contains the snapshot type (for now, we’ll create Full snapshot), and the paths to the memory (mem_file_path) & microVM state file (snapshot_path).

curl -s -X PATCH --unix-socket /tmp/firecracker.socket \
    http://localhost/vm \
    -H 'Content-Type: application/json' \
    -d '{"state": "Paused"}'

curl -s -X PUT --unix-socket /tmp/firecracker.socket \
    http://localhost/snapshot/create \
    -H 'Content-Type: application/json' \
    -d '{
        "snapshot_type": "Full",
        "snapshot_path": "/tmp/snapshot.file",
        "mem_file_path": "/tmp/mem.file"
    }'

This leaves us with our two snapshot artifacts on disk:

$ ls -lh /tmp/snapshot.file /tmp/mem.file
-rw-rw-r-- 1 root root 128M /tmp/mem.file # Firecracker default memory size is 128M
-rw-rw-r-- 1 root root  15K /tmp/snapshot.file

Restoring from a snapshot

Now let’s try restoring a new microVM from that snapshot. First we’ll kill the existing firecracker process

pkill firecracker

To restore, we start a fresh Firecracker process pointed at a new API socket. We don’t need to pass configuration anymore because that configuration lives inside the vm state file.

rm -f /tmp/restored.socket \
&& /tmp/firecracker --api-sock /tmp/restored.socket

Then we load the snapshot with a PUT to /snapshot/load, handing it the two files we created earlier and asking it to resume the microVM:

curl -s -X PUT --unix-socket /tmp/restored.socket \
    http://localhost/snapshot/load \
    -H 'Content-Type: application/json' \
    -d '{
        "snapshot_path": "/tmp/snapshot.file",
        "mem_backend": {
            "backend_type": "File",
            "backend_path": "/tmp/mem.file"
        },
        "resume_vm": true
    }'

If we hit the counter again, we’ll see that state was persisted across microVMs!

$ curl -s http://10.200.0.2:8000/count
{"count":4}

Resuming another microVM

What’s better than one resumed microVM? Two! Let’s try and spin up another one from our memory image

rm -f /tmp/restored-2.socket \
&& /tmp/firecracker --api-sock /tmp/restored-2.socket

curl -s -X PUT --unix-socket /tmp/restored-2.socket \
    http://localhost/snapshot/load \
    -H 'Content-Type: application/json' \
    -d '{
        "snapshot_path": "/tmp/snapshot.file",
        "mem_backend": {
            "backend_type": "File",
            "backend_path": "/tmp/mem.file"
        },
        "resume_vm": true
    }'

{
    "fault_message": "Load snapshot error: ... Failed to restore devices: Error restoring MMIO devices: Net: Failed to create a network device: Open tap device failed: Error while creating ifreq structure: Resource busy (os error 16). Invalid TUN/TAP Backend provided by tap0. ..."
}

The Firecracker process exits with an error before the microVM ever resumes. The problem is the snapshot captures the microVM’s device configuration, including network setup. The first clone is already running on the tap0 device, so the second clone is unable to use it.

The solution here is to run each microVM inside its own network namespace — that gives each microVM its own private network stack, where the tap0, 10.200.0.1/24, and guest 10.200.0.2 the snapshot expects can all exist independently of every other clone. The clones can’t see each other, so they can’t collide.

Running clones in network namespaces

Let’s recreate the expected network topology inside a network namespace.

setup_ns() {
    local ns="$1"
    sudo ip netns add "$ns"
    sudo ip netns exec "$ns" ip tuntap add dev tap0 mode tap
    sudo ip netns exec "$ns" ip addr add 10.200.0.1/24 dev tap0
    sudo ip netns exec "$ns" ip link set dev tap0 up
    sudo ip netns exec "$ns" ip link set dev lo up
}

setup_ns clone0
setup_ns clone1

Now we resume the snapshot inside a namespace by prefixing the Firecracker command with ip netns exec.

resume() {
    local ns="$1" socket="$2"
    sudo ip netns exec "$ns" /tmp/firecracker --api-sock "$socket" &
    sleep 1
    sudo curl -s -X PUT --unix-socket "$socket" \
        http://localhost/snapshot/load \
        -H 'Content-Type: application/json' \
        -d '{
            "snapshot_path": "/tmp/snapshot.file",
            "mem_backend": {
                "backend_type": "File",
                "backend_path": "/tmp/mem.file"
            },
            "resume_vm": true
        }'
}

resume clone0 /tmp/clone0.socket
resume clone1 /tmp/clone1.socket

Each clone resumed from the same snapshot, so they both pick up the counter at the same starting value. But they’re now completely independent — hitting one doesn’t touch the other. If we hammer clone0 while leaving clone1 alone, their counters diverge:

$ sudo ip netns exec clone0 curl -s http://10.200.0.2:8000/count
{"count":4}
$ sudo ip netns exec clone0 curl -s http://10.200.0.2:8000/count
{"count":5}
$ sudo ip netns exec clone0 curl -s http://10.200.0.2:8000/count
{"count":6}

$ sudo ip netns exec clone1 curl -s http://10.200.0.2:8000/count
{"count":4}

clone0 is at 6 while clone1 is still at 4. We forked one snapshot into two microVMs that now have their own separate memory!

But both microVMs have the same ip address, how can that work? What if we try addressing them from the root namespace?

$ curl -s --max-time 3 http://10.200.0.2:8000/count
# (no response)

The namespaces solved our device collision problem, but introduced a new one: there’s no path between the clone namespaces and the host’s root namespace.

Reaching the microVMs from the root namespace

To talk to a clone from the root namespace, we need a wire between the two namespaces. That’s what a veth pair is used for.

We’ll give each clone its own /29 block out of 172.16.0.0. The /29 means that each namespace has 8 addresses. The bottom address (network address) and the top address (broadcast address) are unusable, so each network gets 6 ip addresses that are addressable from the host. Each device takes up two addresses, so that means you can add up to three addressable devices. This can be useful if your microVM has different interfaces for different classes of traffic.

CloneSubnetNetworkRoot sideNamespace sideBroadcast
clone0172.16.0.0/29172.16.0.0172.16.0.1172.16.0.2172.16.0.7
clone1172.16.0.8/29172.16.0.8172.16.0.9172.16.0.10172.16.0.15

The choice of 172.16.0.0 is to separate from the existing 10.x range - as long as they don’t overlap, these are all interchangeable.

By itself, just assigning the IP addresses won’t work because the microVMs are still listening on 10.200.0.2. To solve this, we add a DNAT rule inside the namespace that rewrites the destination of any packet bound for the veth address (172.16.x.x) to the guest’s 10.200.0.2.

wire() {
    local ns="$1" root_ip="$2" ns_ip="$3"

    # Create the veth pair and move one end into the namespace
    sudo ip link add veth-"$ns" type veth peer name veth-peer
    sudo ip link set veth-peer netns "$ns"

    # Address and bring up both ends of the cable
    sudo ip addr add "$root_ip"/29 dev veth-"$ns"
    sudo ip link set veth-"$ns" up
    sudo ip netns exec "$ns" ip addr add "$ns_ip"/29 dev veth-peer
    sudo ip netns exec "$ns" ip link set veth-peer up

    # Rewrite traffic for the veth address to the guest behind it
    sudo ip netns exec "$ns" sysctl -w net.ipv4.ip_forward=1
    sudo ip netns exec "$ns" iptables -t nat -A PREROUTING \
        -d "$ns_ip" -j DNAT --to-destination 10.200.0.2
}

wire clone0 172.16.0.1 172.16.0.2
wire clone1 172.16.0.9 172.16.0.10

Now each clone is reachable from the root namespace at its own address — no ip netns exec required:

$ curl -s http://172.16.0.2:8000/count
{"count":7}
$ curl -s http://172.16.0.10:8000/count
{"count":4}

Wrapping up

We can now snapshot a running microVM and resume multiple concurrent clones from a single snapshot, each isolated in its own network namespace and reachable from the host.

You can view code samples for this article on my github .

Articles in this Series