How System Calls Work
Why should you care about syscalls?
As a web developer, learning about syscalls and the infrastructure around them
can make you feel quite a bit more confident in debugging and reasoning about
how systems will perform. Ruby and C++ both have their own idiomatic ways of
opening files, but in the end they both end up using the syscall
is because userland processes (like web applications) have only one way of
communicating with the operating system: syscalls.
What to except when you’re excepting
In order for a process to communicate with the kernel, it has to pass execution to it somehow along with a number of arguments. It does that by issuing an exception, which moves the control flow from your process to the kernel’s interrupt handler, which processes the arguments and selects the correct syscall.
An exception is just one name for this concept - but there are a lot of names for the same thing: “different manufacturers have used terms like exceptions, faults, aborts, traps, and interrupts."1
In order to better understand this, let’s take a look at very simple syscall in
x86 assembly: getpid,
which returns the id of the calling process. Its syscall number is
20, so we
put that into the
eax cpu register since that’s where the kernel will look to
determine which syscall to call.
mov eax, 20 int 0x80
int instruction above triggers a software interrupt or exception, which
causes the kernel to halt and run its interrupt handler. It sees that the
interrupt vector we specified was
128, which corresponds to the
syscall interrupt vector. The kernel looks in the
eax register and see if it
can find that number it its syscall table. If found, it calls that syscall.
Let’s take a look at exactly where that takes you inside the the Linux kernel, annotated with (my) comments:
sysenter_do_call: ; cmpl - subtract ; Subtract the total number of syscalls from the syscall number (%eax) cmpl $(NR_syscalls), %eax ; jae - jump if Above or Equal to 0 ; If the syscall number was out of range, handle bad call jae sysenter_badsys ; call - call a subroutine ; *sys_call_table(,%eax,4) ; - The * is a pointer dereference ; - The X is a Y... etc ; Call the syscall you wanted call *sys_call_table(,%eax,4)
As we saw before, the syscall number goes in register
eax. The Linux kernel
knows nothing about syscall names. All it knows is their numbers, and this is
where it looks up the syscall’s function pointer and calls it. Here are some
examples of some syscalls you might recognize and their numbers:
open(2)- open a file
chdir(2)- your good friend,
nice(2)- change a processes nice value
Here’s a full table of syscalls and their arguments.
Once a syscall number is decided, it is never changed. As you can imagine, doing so would literally blow up all the programs.
Aside: when you see syscalls written like this:
is referring to the man page
level for syscalls,
Passing arguments to syscalls
Ok, so a syscall is just a function in the kernel you call in a special interrupt-y way. How do you pass it arguments?
We saw that you put the syscall number in register
eax. The kernel looks for
arguments in registers
edx. Let’s take a look at a hello
world program using the syscalls
global _start section .text _start: mov eax, 4 ; write mov ebx, 1 ; stdout mov ecx, msg mov edx, msg.len int 0x80 ; write(stdout, msg, strlen(msg)); mov eax, 1 ; exit mov ebx, 0 int 0x80 ; exit(0) section .data msg: db "Hello, world!", 10 .len: equ $ - msg
The first argument (in
ebx) is a file descriptor - in this case
second argument (
ecx) is a pointer to the start of the message, and the third
edx) is the message’s length).
exit takes one argument, the exit code - which was
If the syscall you’re using takes a lot of arguments, instead of putting values in the registers, you’ll put pointers to data structures you own in userspace.
If you want to learn more about syscalls, please consult these fine sources of good syscall information:
- tldp.org - How System Calls Work on Linux/i86
- Intel 80x86 Assembly Language OpCodes (mathemainzel.info)
- x86 Assembly Guide (cs.virginia.edu)
- Say hello to x64 Assembly (0xax.blogspot.com)
If you want to see syscalls in action, try using the
strace command on Linux.
There’s a fantastic
on it by Julia Evans.
Interrupts, Traps, and Exceptions: flint.cs.yale.edu ↩︎