Ramdisk Versus Ramfs - Memory usage issues
A copy of this paper can be obtained in PDF format here.
The question of how RAM is used on RAM-only systems arose when Tin Hat was first designed. The issue came up again and I decided to test three ways of setting up a RAM-only system to lay the question to rest:
1) The traditional initial ramdisk image, initrd. Here one puts down an ext2 filesystem onto a file via a loopback device and populates it. The file is then unmounted and gzipped. On boot, it used by the bootloader as the initial root filesystem. The kernel must be configured with CONFIG_BLK_DEV_INITRD=y and CONFIG_BLK_DEV_RAM_SIZE bigger than the size of the ext2 filesystem. When the system boots, "df" reports that root filesystem is on /dev/ram0 (the ramdisk) with sized fixed to that of the ext2 filesystem you created. It cannot be resized, thus fixing the division between RAM set aside for the filesystem and RAM used for processes. "free" reports the used RAM as = the fixed RAM set aside for the ramdisk plus RAM used for processes.
2) The newer initial ramfs image, initramfs. Here one populates a directory, and then creates a compressed cpio archive which is expanded into ramfs upon boot and becomes the root filesystem. The kernel must be configured with CONFIG_BLK_DEV_INITRD=y but one does not need to set CONFIG_BLK_DEV_RAM_SIZE, nor does one need to set CONFIG_TMPFS=y. When the system is up, "df" does not report the root filesystem and one cannot interact with it by doing things like "mount --bind / dir". Also the distinction between what RAM is set aside for the filesystem and what RAM is used for processes is blurred. "df" reports nothing and "free" reports total usage without distinction, ie. used RAM = RAM used for files (as reported by "du") plus RAM used for processes.
3) Bootstraping into tmpfs. In this case, one uses either an initrd or initramfs image to get into a small ram-only environment that sets up a tmpfs filesystem, unpacks the new root filesystem into it from some image like a squashfs on the boot device, and finally does a switch_root to tmpfs. Two images are needed, the first initramfs and the second image which will be decompressed into tmpfs. Here the kernel must be configured for initrd/initramfs, but in addition needs CONFIG_TMPFS=y. When the system is up, "df" reports root filesystem mounted as tmpfs at the size set by the initrd/initramfs when it was mounted, and "free" reports the total RAM usage without distinction as for an initramfs above. Unlike #2 above, one can interact with the root filesystem, for example, one can set (or reset) a limit on how much RAM is set aside for the root filesystem by doing "mount -o remount,size=512m /" and one can do "mount --bind / dir".
Note: It is possible to configure the kernel to support an initramfs (CONFIG_BLK_DEV_INITRD=y) but not support tmpfs (CONFIG_TMPFS=n).
I wanted to test these with respect to three aspects of memory usage by processes: 1) memory allocation on the heap. 2) memory allocation on the stack, 3) memory needed for text. I didn't expect any difference between stack and heap allocation, but I particulary wanted to test if ramdisk and ramfs did "execution in place", ie., they do not copy the process's text from the filesystem to page memory as is done when the filesystem is on a "real" block device, eg. a hard drive. To check this I wrote three programs:
1) heap.c which forks into the background and then uses glibc's malloc to request memory from the heap. It sits in a tight loop until killed.
Eg. One executes
~ # heap 4000
to request 4000 (4k) pages of RAM. Here's what pmap gives:
~ # pmap 317
317: heap 4000
08048000 4K r-x-- /bin/heap
08049000 4K r---- /bin/heap
0804a000 4K rw--- /bin/heap
b6e93000 16012K rw--- [ anon ] <-------- 16000K = 4000 pages
b7e36000 1232K r-x--- /lib/libc-2.8.so
b7f6a000 8K r---- /lib/libc-2.8.so
b7f6c000 4K rw--- /lib/libc-2.8.so
b7f6d000 16K rw--- [ anon ]
b7f71000 4K r-x-- [ anon ]
b7f72000 108K r-x-- /lib/ld-2.8.so
b7f8d000 4K r---- /lib/ld-2.8.so
b7f8e000 4K rw--- /lib/ld-2.8.so
bfc79000 84K rw--- [ stack ]
2) stack.c which works similarly but the allocation is done on the stack. It links against glibc. One caveat, even though we set ulimit -s 0, its easy to get a stack overflow. One can run multiple instances of stack to repeatedly allocate stack memory. Eg. With
~ # stack 1000
~ # pmap 336
336: stack 1000
08048000 4K r-x-- /bin/stack
08049000 4K r---- /bin/stack
0804a000 4K rw--- /bin/stack
b7df8000 8K rw--- [ anon ]
b7dfa000 1232K r-x-- /lib/libc-2.8.so
b7f2e000 8K r---- /lib/libc-2.8.so
b7f30000 4K rw--- /lib/libc-2.8.so
b7f31000 16K rw--- [ anon ]
b7f35000 4K r-x-- [ anon ]
b7f36000 108K r-x-- /lib/ld-2.8.so
b7f51000 4K r---- /lib/ld-2.8.so
b7f52000 4K rw--- /lib/ld-2.8.so
bf767000 4012K rw--- [ stack ] <-------- 4000K = 1000 pages
3) mktext.c creates text.asm which is then assembled to text. This binary does NOT link against anything, rather it uses registers ebc, ecx and edx to calculate the fibonacci sequence without allocating any memory on the stack or heap. Its a just 1000 pages (ie 4MB) of
mov edx, ebx
add edx, ecx
mov ebx, ecx
mov ecx, edx
Crazy no? The point is, when this is run, does the kernel copy the 4MB of text to page memory, or does it execute in place? pmap gives
~ # pmap 310
08048000 4100K r-x-- /bin/text <-------- 4000K = 1000 pages
b7fc8000 4K r-x-- [ anon ]
bfdb3000 84K rwx-- [ stack ]
text forks into the background when run, so we need only run a bunch of these to see what happens to our RAM usage.
Building the systems
I have prebuilt ISO's, but if you want to rebuild them from scratch, all the necessary goodies can be found here. The README describes what scripts to run to build the ISOs. I even put the kernel images, binaries and libraries in the directory. The code to build heap, stack and text is in the "tests" directory. I grabbed busybox, pmap and the libraries off a vanilla working Gentoo system --- glibc-2.8_p20080602-r1. To get busybox's configuration, just run "busybox bbconfig". To get the kernel's config, you'll find it in /proc/config.gz. I used vanilla linux-2.6.28 patched to .5
I configured these kernels for the hardware on a VMWare or qemu emulator. The ISOs expect the boot device to be a cdrom at /dev/hda. If you want to use qemu, you'll have to switch to hdc in the build-xxx.sh scripts and rebuild the ISOs. I also set CONFIG_BLK_DEV_RAM_SIZE=32768 K, and used 64MB of physical RAM in the emulator. Finally, the only difference between bzImage and bzImage.ne2e is that the former has Ext2 execution in place support enabled, whereas the later does not. In all other ways, they are identical kernels.
1. Ramdisk system: The ISO is project-initrd.iso The raw results of the test can be found below. Here's a summary of points to note:
- In all of the tests, you can see that "df" shows 31729K of RAM are immediately tied up in the ramdisk image. This is a fixed amount that cannot be changed, and does not change throughout the tests. When one considers that 23% of the ramdisk is used, this means that 77% of 32M = 25M will never be available for paging should it be needed.
- The initial value reported by "free" is 37500 +- 12K, ie. about 5.8M above the 32M for the ramdsik. Whether pages are allocated in the process's heap or stack doesn't matter as we expected. So both heap and stack test show RAM usage jumping about 4000K with each 1000 page allocation. Eg. In the heap test, the RAM usage increases by 4116K,4092K, 4088K, 4096K ... on the initial and subsequent allocations. Similarly, in the stack test, the RAM usage jumps by 4112K, 4092K, 4092K, 4080K ...
- The "text" test was more interesting. The first running instance of "text" forced an allocation of 4208K as reported by "free", while subsequent instances only forced an allocation of 28K, 32K, 28K, 32K .... Clearly the loader (handler for execve) copied the text of "text" from the ramdisk to page memory upon the first execve, but then shared the text between running instances upon subsequent execve's. There was no execution in place and RAM is wasted as two copies of the process's text are hosted in RAM --- one in the ramdisk and the other in page memory. This is the same behavior as that of the "disked" system below. Surprisingly, setting the kernel's config "ext2 execution in place support" made no difference. I'm not sure why.
2. Initramfs system: The ISO is project-initramfs.iso Here's a summary of points to note:
- Upon boot "df" reports no root filesystem, "df -a" reports only the pseudo-filesystems mounted by the rcS scripts, and "df -i" reports no i-nodes. There is no way to set or reset root filesystem's size. It is blurred in with the rest of the RAM and one can add files to it without hitting any limit until RAM is exhausted. Eg. "dd if=/dev/zero of=waste" bottoms out with an "out of memory" kill by the kernel, and any attempt to spawn further processes, like "heap 1000", results in the same.
- Upon boot "free" reports about 10M RAM usage, but we cannot say how much is filesystem and how much is page memory. "du" gives us a clue, as does comparing to the ramdisk test --- about 6.5M for files and 3.5M for processes.
- Allocation of pages in a process's heap or stack proceeds as expected. It jumped by 8M intervals using heap 2000.
- Again, the "text" test was more interesting. Unlike the ramdisk situation, in this case every new running instance of "text" only forced about 32k of allocation --- even the very first instance! This means the loader does execution in place without copying the text from the root filesystem into page memory. I was able to simultaneously run hundreds of instances without eating up much RAM. The load was a different story!
- It is not surprising that "ext2 exec in place" support is irrelevant here. The whole ramfs system sits higher up in the kernel's VFS subsystem. We tested it anyhow.
3. Initramfs->tmpfs system: The ISO is project-initramfs-tmpfs.iso. Here's a summary of points to note:
- Upon boot, "df" reports a root filesystem of 32M in size with a usage of 6.6M, while "free" reports 9.1M used --- the 9MB is understood as the 6MB for root filesystem plus 3MB for processes. Unlike the pure initramfs system, there is a set limit on the size of root filesystem and "dd if=/dev/zero of=waste" does not bottom out with an "out of memory" kill but rather with "no space left on device". As expected, "df" reports no space left while "free" reports that there is still available memory in RAM and it is still possible to spawn new processes. The limit acts as a safety feature should something go amok and fill up root filesystem. Furthermore, it is possible to set or reset the size of the tmpfs system using "mount -o remount, size=XXX, nr_blocks=YYY, nr_inodes=ZZZ /". Interestingly enough, there are i-nodes in a tmpfs system making it similar to working with ext2.
- Allocation of pages in the heap or stack proceed as expected with one subtlety. Only the "used" part of the tmpfs filesystem is counted against "free", so it is possible for page memory to encroach on tmpfs's free memory but not vice versa. To see this effect, you can try the following: upon boot tmpfs shows its limit is 31356k with only 6620k is used and 24736k available. This leaves approximately 32M free for page memory; however, one can run "heap 13000" forcing the allocation of 52140k of RAM --- that's approximately 20M into tmpfs's allegedly available space as reported by "df". In reality, tmpfs only has about 4M free. This time "dd=/dev/zero of=waste" does bottom out with an "out of memory" kill, but its the "heap" process that gets the axe!!! "ps" shows that "heap" is killed and "ls -lh" reports the size of "waste" is 24.1M as expected from the "df" report of available memory.
- In sum, the tmpfs limit is complex. It does not allow a growing root filesystem to intrude on page memory; however; it does allow page memory to encroach on root filesystem memory, but will reclaim it for root filesystem by killing the trespassing process.
- The text test gave the same results as for the pure initramfs system.
4. The control system: To compare the above results, I also looked at a "disked" version. Its a VMWare image: disked.tar.bz2 Here are the results:
- The disk usage is the same as with the ramdisk. It is fixed and cannot be changed.
- Allocation of pages in the heap or stack count against RAM usage.
- The initial spawning of "text" incurs a 4148k hit against RAM, but subsequent instances only incur about 32k.
- As with the ramdisk system, it is not possible for page memory to encroach on filesystem memory or vice versa. There was no swapping during any of these tests.