File Saver: Fast parallel file size scanner
This blog was originally published on https://beijaflor.io/blog/06-2020/filesaver-freeing-up-space/
Since I got my latest MacBook, I can only go for a couple of months before it starts running out of free disk space. It’s a pretty bad situation where to save a bit on SSD size, I have to spend my time cleaning-up large files like music samples, VMs,
node_modules and so on.
So then I decided to save a lot of time 😂 and instead of buying one of the available “disk cleaning” solutions, I spent some time trying some open-source software and eventually decided to write FileSaver, a fast multi-threaded disk usage scanner.
On my MacBook, scanning 5.3 million files takes 100 seconds, with around 51.000 files/second of throughput. The compromise is high memory usage of around 2GB.
What I used to do with coreutils
I used to do the following command combination to find sizes:
# Find file sizes
du -h ./directory | \
# Sort by size
# (just `sort` on linux, requires gnu coreutils on OSX)
gsort -h | \
# Put on a pager
This works fine except that:
- It’s relatively slow
- It’s not interactive
I’ve also tried a great alternative to the above, called
duc is a great open-source utility written in C that performs this task a bit faster and with interactive options. It can also generate an index that can be re-used, so we don’t need to scan all files when looking for places to clean.
Even so, what I found was that on my filled-up disk, it was really slow to generate the index for the whole thing, including system directories like
/usr/local. As a developer, a lot of the trash was around those.
Parallel size scanning
duc and found that it’s single threaded nature was a bit bottleneck. Writes to the index blocked scanning file sizes and the calculations to aggregate sizes also probably compromised throughput.
I ran over this issue:
Some thoughts about indexing, filtering, parallelization, etc https://github.com/zevv/duc/issues/205
The idea that I could write a multi-threaded program and get a massive performance boost stuck with me.
I considered contributing to
duc, but it had two major flaws for me:
- It’s GPL licensed — YMMV
- It’s written in C
The 2nd point is just because, knowing a bit of C++ I didn’t see the point of going to C.
I then designed and iterated on a parallel scanning architecture / aggregation algorithm that achieves pretty good results in terms of throughput.
Problem domain and aggregation overview
The problem we have to solve is that of finding directory sizes based on all their children’s sizes.
Given a certain
path to scan, we want to
lstat it. This will give us its size and file type. If this
path is a directory, we then want to scan all its children.
This path will have “completed” once all its children have completed or if
lstat says it isn’t a directory.
So we have to continuously scan children and as they come in, update the parent path’s completed children count (and do this recursively up).
To support better UX and performance, whenever a path is completed, its delta update in size is carried up each of its parents.
For example, imagine we have the following directory structure:
1 We’ll start the scan at
- Since it’s a directory we read its children
- Its size and 1 child count is stored
2 We do the same at
/content/blog and then
3 At this point our state looks like:
/content- 1 pending child path
/content/blog- 1 pending child path
/content/blog/filesaver-freeing-up-space- 2 pending child paths
4 We scan
/content/blog/filesaver-freeing-up-space/index.md- is completed + 100KB
/content/blog/filesaver-freeing-up-space- 1 pending child paths (+ 100KB)
- Other paths’ pending counts are kept, but their total sizes are updated
/content/blog- 1 pending child paths (+ 100KB)
/content- 1 pending child paths (+ 100KB)
5 We scan
/content/blog/filesaver-freeing-up-space/logo.png- is completed + 300KB
/content/blog/filesaver-freeing-up-space- is completed (+ 300KB)
/content/blog- is completed (+ 300KB)
/content- is completed (+ 300KB)
FileSaver is a simple program, but it’s structured. The architecture is as follows.
- Lock based queue
- Worker Threads
- Reader Thread
- UI Thread
We pass data between threads using an uninteresting lock based queue, plus one extra lock on the size/pending state, since the UI thread will read it while its being processed.
For this we have
The workers work with two queues:
- workQueue provides incoming paths to scan
- resultQueue has entries pushed onto it, with their sizes, types and child paths attached
The worker will enqueue any children it finds.
The reader thread performs the aggregation as described in the aggregation overview. The state gets stored in a giant hash-map (
std::unordered_map) mapping paths to their sizes and number of pending children.
The UI uses AppKit and is written in Objective-C++. It polls the state for the directories that are visible on the screen.
For now, FileSaver is available from its GitHub page. The project has structure that should make it convenient to include it as a library onto a different codebase. Though there could still be improvements for this use-case like organizing public headers and the build system, it goes most of the way and should have good DoxyGen docs.
The code is distributed under the MIT license.