How System Calls Work

Why should you care about syscalls?

As a web developer, learning about syscalls and the infrastructure around them can make you feel quite a bit more confident in debugging and reasoning about how systems will perform. Ruby and C++ both have their own idiomatic ways of opening files, but in the end they both end up using the syscall open(). This is because userland processes (like web applications) have only one way of communicating with the operating system: syscalls.

What to except when you’re excepting

In order for a process to communicate with the kernel, it has to pass execution to it somehow along with a number of arguments. It does that by issuing an exception, which moves the control flow from your process to the kernel’s interrupt handler, which processes the arguments and selects the correct syscall.

An exception is just one name for this concept - but there are a lot of names for the same thing: “different manufacturers have used terms like exceptions, faults, aborts, traps, and interrupts."1

In order to better understand this, let’s take a look at very simple syscall in x86 assembly: getpid, which returns the id of the calling process. Its syscall number is 20, so we put that into the eax cpu register since that’s where the kernel will look to determine which syscall to call.

mov eax, 20
int 0x80

The int instruction above triggers a software interrupt or exception, which causes the kernel to halt and run its interrupt handler. It sees that the interrupt vector we specified was 0x80, or 128, which corresponds to the syscall interrupt vector. The kernel looks in the eax register and see if it can find that number it its syscall table. If found, it calls that syscall.

Let’s take a look at exactly where that takes you inside the the Linux kernel, annotated with (my) comments:

sysenter_do_call:
  ; cmpl - subtract
  ; Subtract the total number of syscalls from the syscall number (%eax)
  cmpl $(NR_syscalls), %eax

  ; jae - jump if Above or Equal to 0
  ; If the syscall number was out of range, handle bad call
  jae sysenter_badsys

  ; call - call a subroutine
  ; *sys_call_table(,%eax,4)
  ;   - The * is a pointer dereference
  ;   - The X is a Y... etc
  ; Call the syscall you wanted
  call *sys_call_table(,%eax,4)

As we saw before, the syscall number goes in register eax. The Linux kernel knows nothing about syscall names. All it knows is their numbers, and this is where it looks up the syscall’s function pointer and calls it. Here are some examples of some syscalls you might recognize and their numbers:

Here’s a full table of syscalls and their arguments.

Once a syscall number is decided, it is never changed. As you can imagine, doing so would literally blow up all the programs.

Aside: when you see syscalls written like this: open(2), exec(2), the 2 is referring to the man page level for syscalls, which is 2.

Passing arguments to syscalls

Ok, so a syscall is just a function in the kernel you call in a special interrupt-y way. How do you pass it arguments?

We saw that you put the syscall number in register eax. The kernel looks for arguments in registers ebx, ecx, and edx. Let’s take a look at a hello world program using the syscalls write() and exit().

global _start

section .text
_start:
  mov eax, 4 ; write
  mov ebx, 1 ; stdout
  mov ecx, msg
  mov edx, msg.len
  int 0x80   ; write(stdout, msg, strlen(msg));

  mov eax, 1 ; exit
  mov ebx, 0
  int 0x80   ; exit(0)

section .data
msg:  db  "Hello, world!", 10
.len: equ $ - msg

The first argument (in ebx) is a file descriptor - in this case stdout. The second argument (ecx) is a pointer to the start of the message, and the third (edx) is the message’s length).

exit takes one argument, the exit code - which was 0.

If the syscall you’re using takes a lot of arguments, instead of putting values in the registers, you’ll put pointers to data structures you own in userspace.

Done

If you want to learn more about syscalls, please consult these fine sources of good syscall information:

If you want to see syscalls in action, try using the strace command on Linux. There’s a fantastic writeup on it by Julia Evans.


  1. Interrupts, Traps, and Exceptions: flint.cs.yale.edu ↩︎

Contents (top)

Comments