Monday, 14 May 2012

A Word or Three on Chunk Allocation

Let's say you're hanging your chunks off a huge SAN, and the local sysadmins insist on allocating filesystems not raw spaces. Which is probably no big deal these days when you have a big SAN to hang off...

Before I get into the real reason for this posting, a small aside. Keep away from journalled file systems. Ext4 currently has a serious flaw with it's ordering, and you'll risk damaged chunks if you have an outage. Ext3 is reliable as such, but why bother interfere with performance by inserting a journalling layer? The engine does it's own journalling - otherwise known as the physical and logical logs, and of course it's finely tuned to the needs of the engine. It also gives a guarantee of data survival if the underlying disk does fail.

Use ext2 - no journalling. Nice. When the engine uses direct IO (where the kernel does not return from the write until the data has hit the disk), or carefully uses Asynchronous IO and it's cleverness, then you've got all the reliability you need. Physical and logical logging is the best and only journalling needed for your engines.

With that out of the way, allocating hundreds of gigs of space even on a fast SAN can take some time. When you attach a chunk with onspaces, the engine ensures the contents of the chunk is fully fleshed out so that all space is allocated. On raw space, it only needs to check the limits. For cooked files, it needs to make sure the file is properly allocated.

UNIX supports a concept called a sparse file, in fact sparse files have been around since day 1 of UNIX. The idea is, if you create a brand-new file, seek out to the 1Mb mark and write a byte, then only the page necessary to store that byte is allocated. The remaining pages at the beginning of the file is simply not allocated. Therefore the engine must cause all pages to be allocated by writing into them. If it didn't, you could easily have the situation where it THINKS the chunk is 200Gb, but there's no actual space available on the file system to ever flesh that out when data starts to pour into your sockets.

The other problem with extending the chunk using the engine is, the engine must be built and it will only extend one chunk at at time. Staking your claim to the space, or maximizing I/O parallelism could save a lot of time.

Being in a desperate hurry to get large allocations attached to an engine, I went looking for a way to do it ahead of time before the engine is even ready. I found some utilities on Linux that pre-extend files, but they didn't happen to be available on the machine I was using. What I did find was a system call with the name posix_fallocate(), which quickly morphed into the utility described below.

Without copying in the man page, basically what posix_fallocate() does is pre-allocate all the pages of a file as fast as possible, and quick tests showed that to be true. The function takes two arguments apart from an open file handle: the offset within the file to start at, and the size in bytes of how much space you want to guarantee is actually allocated. Starting with an offset would be useful if you wanted a special sparse file fleshed out to a certain shape (well, that MIGHT be useful to someone) or more usefully, if you are sure that the current contents of the file are not sparse, you can extend it from the end without forcing the kernel to uselessly run through the existing space.

The end result is a utility I simply call falloc which I've loaded up into GitHub, address


Usage is simple:


usage: falloc [-e] filesize filename
where  -e  extend the file from the current end.
           Without -e, the system call makes sure the entire
           file is fleshed out, so it would take longer when
           used on an existing file. Always use -e if you are
           confident the file is not sparse.
filesize can have multiplier suffixes inspired from dd:
    c=1, w=2, b=512, K=1024, M=1024*1024, G=1024*1024*1024,
    kB=1000, MB=1000*1000, GB=1000*1000*1000
    p=IDS page (2K), p##=page of size ##K eg p4 => 4K

One thing the usage doesn't say: if it has to create a new file before extending it, the mode of the file will be 660 as per the usual IDS needs.

I'm quite happy with the error checking and correctness, and have used it quite freely in my work. I typically put it into a shell script containing this:

set -ex
falloc -e 2000000p /somewhere/somechunk.001
falloc -e 5000000p /somewhere/somechunk.002
.....
etc. The -x makes the shell show the commands as they run - always nice to see, and the -e makes the script stop if it craps out - ie can't make one of the files, or you run out of space.

It can't hurt to run falloc on a pre-existing file. If you accidentally under-extend, the engine will take care of the allocation itself. It will just happen to be slower that way. The worst you can do is over-extend a file, wondering how to claim back that extra 200Gb you just tacked on by mistake. If that happens to you, take a look at the other utility included with falloc on GitHub. It's called truncate. I won't say anything more than this: if you use it incorrectly, mass destruction will knock on your door. DO THE MATHS VERY CAREFULLY. Pay attention to any offsets listed by  onstat -d  in the chunk allocation. Have a 2nd job lined up in case you screw up.

Cheers

PS - this utility is built for Linux, making use of POSIX calls. I don't know if you can compile it on other O/S  such as AIX or Solaris - haven't had the chance to try. As for Windows, can't say either. If the calls are there it should be straight-forward. Just compare the man pages between Linux and the O/S of your choice.

No comments:

Post a Comment