Notes on eBPF

Prerequisites

Software

A Linux machine
BCC

BCC's repo

References

"Learning eBPF" by Liz Rice

header file that defines some useful constants

kernel.org's document

eBPF's in-depth explanation

eBPF's portal website

BCC's document

Getting hands dirty

What is eBPF

It's a mechanism to let one dynamically load and run a piece of VERIFIED code within the kernel, without requiring to change kernel source code or load kernel modules.

Means that:

Ease the difficulties of modifying kernel code or writing a kernel module
Remove the time of rebuilding the kernel or kernel module.
Leave barely any chance to run faulty code

The eBPF programs are event-driven and run when a hook is triggered.

Frontends

A lot of eBPF frontends exists:

BCC
bpftrace
eBPF Go Library
libbpf C/C++ Library

I choose BCC because I have learnt Python, and the examples from the book I use is using BCC.

BPF Maps

Some data structures share data between eBPF or userland programs.

Helper functions

Since eBPF programs are forbidden to call arbitrary kernel functions, the kernel provides several helper functions.

Helper functions can improve security and make the eBPF code kernel version agnostic.

First example

Source code

from bcc import BPF
prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello World!");
    return 0;
}
"""
b = BPF(text=prog)
syscall = b.get_syscall_fnname("execve")
b.attach_kprobe(event=syscall, fn_name="hello")
b.trace_print()

Explanation

Variable prog: an eBPF code written in C, calling the helper function "bpf_trace_printk".
Variable b: a BPF object, the code gets compiled into bytecode at this stage.
Variable syscall: a syscall that the compiled program will attach to, and the get_syscall_fnname is for compatibility.

Notes

This is for demonstrate the use of simple output.

The output will likely be out-of-order if we run a few eBPF codes that use bpf_trace_printk.

Second example

Source code

I omitted the same part.

C part

BPF_HASH(counter_table);
int hello(void *ctx) {
    u64 uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
    u64 *p = counter_table.lookup(&uid);
    u64 counter = 0;
    if (p != 0) {
        counter = *p;
    }
    counter++;
    counter_table.update(&uid, &counter);
    return 0;
}

Python part

while True:
     sleep(2)
     s = ""
     for k,v in b["counter_table"].items():
     s += f"ID {k.value}: {v.value}\t"
     print(s)

Explanation

C part

BPF_HASH(): a macro that creates a hash table map
bpf_get_current_uid_gid(): a helper function to obtain the user id and group id.

Notes

This is for showing the BPF maps.

Notice that the counter_table.lookup is not a valid C source code.

Here I post some of the maps.

See /usr/include/linux/bpf.h for the full list.

Generic maps list:

BPF_MAP_TYPE_HASH
BPF_MAP_TYPE_ARRAY
BPF_MAP_TYPE_PERCPU_HASH
BPF_MAP_TYPE_PERCPU_ARRAY
BPF_MAP_TYPE_LRU_HASH
BPF_MAP_TYPE_LRU_PERCPU_HASH
BPF_MAP_TYPE_LPM_TRIE

What is PERCPU?

per-CPU variants, which is to say that the kernel uses a different block of memory for each CPU core’s version of that map.

What are LPM and TRIE?

Longest prefix match.

Prefix tree.

This name is from the middle syllable of "retrieval".

Non-generic maps list:

BPF_MAP_TYPE_PROG_ARRAY
BPF_MAP_TYPE_PERF_EVENT_ARRAY
BPF_MAP_TYPE_CGROUP_ARRAY
BPF_MAP_TYPE_STACK_TRACE
BPF_MAP_TYPE_ARRAY_OF_MAPS
BPF_MAP_TYPE_HASH_OF_MAPS

Notice that BPF_HISTOGRAM is not implemented in the kernel but in BCC.

See this pull request that implements the BPF_HISTOGRAM

Third example

Source code

C part

BPF_PERF_OUTPUT(output);
struct data_t {
    int pid;
    int uid;
    char command[16];
    char message[12];
};
int hello(void *ctx) {
    struct data_t data = {};
    char message[12] = "Hello World";
    data.pid = bpf_get_current_pid_tgid() >> 32;
    data.uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
    bpf_get_current_comm(&data.command, sizeof(data.command));
    bpf_probe_read_kernel(&data.message, sizeof(data.message), message);
    output.perf_submit(ctx, &data, sizeof(data));
    return 0;
}

Python part

def print_event(cpu, data, size):
    data = b["output"].event(data)
    print(f"{data.pid} {data.uid} {data.command.decode()} " + \
    f"{data.message.decode()}")

b["output"].open_perf_buffer(print_event)
while True:
    b.perf_buffer_poll()

Explanation

BPF_PERF_OUTPUT(): creating a map for output. This is a better one comparing to printk.

perf_submit(): put data into the map.

Notes

The better way to output is using BPF_RINGBUF_OUTPUT.

See this doc

PF_PERF_OUTPUT is differ from BPF_PERF_ARRAY. They are not the same thing.

At this point, we have some fundamental knowledge in eBPF, specifically bcc.

A Step further: define multiple functions

Source code

Variant 1

from bcc import BPF
prog = """
static void world() {
    bpf_trace_printk("world");
}
int hello(void* ctx) {
    bpf_trace_printk("hello");
    world();
}
"""

b = BPF(text=prog)
syscall = b.get_syscall_fnname("execve")
b.attach_kprobe(event=syscall, fn_name="hello")
b.trace_print()

Variant 2

from bcc import BPF
import ctypes
prog = """
BPF_PROG_ARRAY(funcs, 300);
int world(void* ctx) {
    bpf_trace_printk("world");
    return 0;
}
int hello(void* ctx) {
    bpf_trace_printk("hello");
    funcs.call(ctx, 1);
    return 0;
}
"""

b = BPF(text=prog)
syscall = b.get_syscall_fnname("execve")
b.attach_kprobe(event=syscall, fn_name="hello")
world_fn = b.load_func("world", BPF.KPROBE)
prog_array = b.get_table("funcs")
prog_array[ctypes.c_int(1)] = ctypes.c_int(world_fn.fd)
b.trace_print()

You've reached the end of this page. And you may Go to index or visit my friends.
About me and contacts
Except where otherwise noted, this site is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License