User namespace and kernel capability

09/01/2021

Blog, Kernel Dig

A lot of kernel reachable code is only available from an already-privileged user. To restrict features for unprivileged user, the kernel generally uses capabilities. Ubuntu enabled unprivileged user namespace by default, which gives kernel exploit more attack surface.

$ sudo sysctl kernel.unprivileged_userns_clone
kernel.unprivileged_userns_clone = 1

Nowadays most kernel modules are guarded by different capabilities, such as CAP_NET_ADMIN, CAP_NET_RAW. To trigger a vulnerability that is guarded by such capabilities, one way is to grant such capability to target by root user.

sudo setcap cap_net_raw,cap_net_admin=eip ./exp

Another way is to create a user namespace. This is as simple as we described.

void setup_sandbox() {
	int real_uid = getuid();
	int real_gid = getgid();

        if (unshare(CLONE_NEWUSER) != 0) {
		perror("[-] unshare(CLONE_NEWUSER)");
		exit(EXIT_FAILURE);
	}

        if (unshare(CLONE_NEWNET) != 0) {
		perror("[-] unshare(CLONE_NEWUSER)");
		exit(EXIT_FAILURE);
	}

	if (!write_file("/proc/self/setgroups", "deny")) {
		perror("[-] write_file(/proc/self/set_groups)");
		exit(EXIT_FAILURE);
	}
	if (!write_file("/proc/self/uid_map", "0 %d 1\n", real_uid)){
		perror("[-] write_file(/proc/self/uid_map)");
		exit(EXIT_FAILURE);
	}
	if (!write_file("/proc/self/gid_map", "0 %d 1\n", real_gid)) {
		perror("[-] write_file(/proc/self/gid_map)");
		exit(EXIT_FAILURE);
	}

}

Creating a user namespace will invoke create_user_ns() in kernel. It grants CAP_FULL_SET to the user namespace. This is the reason why we can bypass some capability checks within user namespace.

int create_user_ns(struct cred *new)
{
...
set_cred_user_ns(new, ns);
...
}


static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
{
	/* Start with the same capabilities as init but useless for doing
	 * anything as the capabilities are bound to the new user namespace.
	 */
	cred->securebits = SECUREBITS_DEFAULT;
	cred->cap_inheritable = CAP_EMPTY_SET;
	cred->cap_permitted = CAP_FULL_SET;
	cred->cap_effective = CAP_FULL_SET;
	cred->cap_ambient = CAP_EMPTY_SET;
	cred->cap_bset = CAP_FULL_SET;
#ifdef CONFIG_KEYS
	key_put(cred->request_key_auth);
	cred->request_key_auth = NULL;
#endif
	/* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
	cred->user_ns = user_ns;
}

Unfortunately, not all capability checks in kernel are able to bypass by user namespace. In another word, not all capability check targets on current user namespace. By manually inspecting several CVE exploits, I noticed that CVE-2017-7184 bypassed the capability check of netlink_net_capable(skb, CAP_NET_ADMIN). As the code shown below, netlink_net_capable() retrieves net->user_ns which is the user namespace. CVE-2017-7308 uses namespace to bypass ns_capable(sock_net(sock->sk)->user_ns, CAP_NET_ADMIN), it’s basically the same as CVE-2017-7184. However, some capability checks do not involve current user namespace, for example capable() only checks on the initial user namespace. In such a case, only if the initial user namespace has the corresponding capability, otherwise an unprivileged user has nothing to do with it.

bool netlink_net_capable(const struct sk_buff *skb, int cap)
{
	return netlink_ns_capable(skb, sock_net(skb->sk)->user_ns, cap);
}

bool capable(int cap)
{
    return ns_capable(&init_user_ns, cap);
}

Ref

Share your thoughts Cancel reply