Funneled Performance

I mentioned some of the growing pains the other 'nixes are going through in terms of SMP support, and it's worth noting that Linux, the BSDs and Windows aren't alone in this. OS X is still just hitting its adolescence in terms of SMP and will hopefully be taking its own little step further into adulthood with 10.4. This might seem a little weird, as everyone has heard that OS X on a dually kicks much more ass over a single CPU.

Yes, OS X has SMP, a sort of weird variety of it. It also has nice preemptive multitasking, so your system doesn't grind to a halt if you're doing multiple things... theoretically. If you'll remember, OS X isn't FreeBSD 4.4 or FreeBSD 5.

It's a microkernel called Mach, even though it's only used as a pseudo-microkernel, with a FreeBSD 4.4x userland bolted on top of it, with bits of FreeBSD 5 thrown in. It's the bolted-on part that gets a little weird, due to something called 'funnels'.

Funneled Performance

When you stop and think about it, just bolting on a BSD userland onto a separate kernel is a little weird. Normally BSD would have its own kernel, but in this case it's Mach... and they have to communicate somehow. They do it through something called a 'Funnel', which is an abstraction to serialize and sync things between the two. Apple had to rip out and then rework how BSD would normally sync itself to the kernel.

The funnel situation is one of the reasons why I've always sort of laughed at the quad or eight-CPU rumors that circulate every once in awhile. Yes, people from Apple have said things like "We've tested Mach on up to 20 CPUs and it scaled beautifully", and they weren't lying. But there is no way in hell the current OS would ever scale up like that. Not because Mach isn't capable of it, but because of the way BSD and Mac are currently integrated together.

If you're inclined, there's a decent write-up by an Apple engineer on how and why much of this came to be over here:

4. Funnels: Serializing access to BSD

Funnels are quite possibly one of the most confusing elements of xnu for people familiar with other BSD kernels. They are not a lock in the traditional sense of the word (though they are sometimes referred to as ``flock'' within the kernel). Funnels are used to serialize access to the BSD segment of the kernel. This is necessary because that portion on the codebase does not have fine-grained locking, and is not fully reentrant. There are currently two funnels within the kernel, the kernel funnel (it might be more appropriate to call it the filesystem funnel, though it does protect a few calls besids the file systems), and the network funnel.

Think of it this way. When a thread is given some CPU time, it establishes a lock to the kernel with the funnel abstraction. It owns the kernel in a way, for the time it is doing its business, because it owns the funnel to the BSD-ish part of the kernel. Anything making a system call, directly or indirectly, takes the funnel and nothing else can really do its business because something else has a lock on the doorway to the kernel.

The funnel is mapped to various 'mutexes' living in the Mach kernel, or rather is a mutex, and since a mutex is often talked about threading and such comes up but rarely explained... a mutex stands for 'mutual exclusion object'. It's an imaginary object that lives in the kernel and allows multiple threads to share the same resource... like file access APIs. However, it can't be done simultaneously. It's all serialized. One thread can have file access, then another, but not at the same time.

The funnel is what keeps more than one thread from the BSD-side of things from running within the kernel at the same time. Remember, the BSD side of things couldn't really be trusted, so only one thread could be run at a time so everything wouldn't go fubar. The BSDs had threading, but everything changes when you start adding in more CPUs, either virtually with HyperThreading or physically with multiple CPUs.

As mentioned in a prior post, the appearance of things running at the same time on a single-CPU box is an illusion, but with a multi-CPU box it becomes a reality and really bad things can happen with threads if that isn't taken into consideration. As a quick detour, there are a few different types of threads in OS X, which exist in a hierarchy and are often layered over top of one another. Here they are, in order:

  • Mach Threads, which are the lowest level. User-space apps don't create these things directly.
  • POSIX threads (pthreads) which get layered on top of Mach threads. Search the site on Posix for more.
  • NSThreads & TS Threads.
    This one is a little weird, but basically NSThreads are Cocoa threads. TS Threads is an abstraction called the Thread Manager that allows Carbon apps to do things they may need to coming from OS9, or at least in a different way than you'd do normally. IE, remember a lot of apps in OS9 were based on cooperative multitasking, not preemptive, so things get weird.

    In any event, these get layered over pthreads, which get layered over Mach threads. Some of the Carbon business gets weird, as if the thread is created internally by Carbon it just wraps a pthread, but if the thread is created by Carbon APIs at the request of a Carbon app, it has to go through the Thread Manager or with an MP layer task... it just gets very 'expensive' in terms of CPU time. This can make sense when you realize you're emulating a cooperative-based system on a preemptive-based system.

Now all of these various threads need to get run, and the Mach scheduler doesn't really care what type of thread it is, although some of them can get very expensive. They are all equal in its eyes, with the exception of high-priority threads which can preempt everything as long as it doesn't need to lock itself to a funnel. There's one big thing here: The scheduler will not schedule a thread for time if it is blocking in some way while waiting for something. Like, say on I/O.

To give an example, the part of the kernel that handles file system access might be started, and a unique mutex created for it. Going through a funnel, a thread can connect to it, creating a lock, but while its doing it nothing else can have file system access, or I/O. And the scheduler is not going to run anything that is blocking on I/O, which means the system just gets choked up.

When you stop and think about it, you can start to get an idea of where you can start to get lots of problems once you have a lot of things going on, especially when you're dealing with more than one CPU. However, things aren't quite so dire. There are a few things keeping it from just being an egregious killer:

  • There isn't just one funnel anymore, now there are two. One for normal userland things, and one for network things. Originally there was only one one, but because there are two there's a big performance benefit if your app/task uses both the disk and the network.
  • IOKit, Apple's sub-system for drivers, communicates with Mach via its own scheme, which is considered to be much finer-grained. As I mentioned, it's a pseudo-microkernel now, which means these are basically Mach threads, owned by the kernel.

Still, the situation is far from ideal, and causes a lot of performance killers, and you can probably imagine what happens when both of the funnels get locked onto at the same time. It's not just a problem with boxes that have multiple-CPUs, that just exacerbates it. This is one of the reasons why you can see the performance of OS X just dive, depending on what you're doing. You might only be using a bit of the CPU, but because of the funnel problem, everything goes herky-jerky and non-responsive. What you're doing on the system shouldn't be doing it, and it wouldn't do it on say, Linux, but it's happening anyways. Funnels.

Now, while it kinda sucks, the two-funnel thing is basically a stop-gap solution. FreeBSD at the time just had serious problems when it came to SMP, and as I mentioned is still just really getting the ball rolling. FreeBSD v5 is considered to have extremely capable SMP support now, but not then. Apple had to do something in order to have SMP, but the BSD subsystem it was attaching to was not only not efficient when it came to SMP, much of the code wasn't written in a way that would be safe. So we got funnels, but it really was an interim solution while everyone got their ducks in a row.

Let me reiterate -- this is a case where the solution is actually fairly elegant given the terms of the problem. It's just far from ideal.

Now in 10.4, Apple has stated that there'll be finer-grained SMP support, but hasn't really gone into great detail on what that means and how it'll be provided. Will there be more funnels? Will the funnel situation be going away via a different type of glue to a more modern BSD subsystem?

No real clue, as if Apple has said it anywhere publicly I haven't seen it. If it's the real deal, and not just a minor bump, there could be interesting implications:

  • Smoother, more predictable performance
    Those performance killers I mentioned are real, and all you have to do to see them yourself is really smack the file system hard from one app and then try to do things in another. Blocking on I/O is bad, mkay? An iBook makes an idea test case. :)
  • The big boxes
    As mentioned, I've always sort of laughed at the 'big box' rumors that circulate from time to time. OS X, as it currently stands, would start to run into some real problems with a quad-CPU box in terms of efficiency, let alone something like an 8-CPU behemoth. Those are such specialized boxes in any event, but the door would certainly open wider for Apple to push them. Remember, Linux really had to grow into this too, with heavy help from IBM and others.

    Because of how expensive these boxes are, and how targeted they are, and how minor Apple's presence in this world is... it's still hard for me to picture them doing it, but it starts to become reasonable that it could be thrown on IBM hardware.
  • Multi-Core
    Much of the progress towards CPU development has been towards multi-core CPUs, which do have a lot going for them. However when you take a dually-CPU system with dual-core chips you're starting to talk 4 cores needing to be fed and tended by the OS. Your OS really needs to be efficient with them. We haven't seen much word of this on Apple's side. IBM is pushing them for the Power5 and such, but those boxes generally have Linux or it's own AIX unix. Motorola has dual-core G4s in the works, but those are generally going towards the embedded sector.
  • HyperThreading
    IBM has started to push SMT, or what Intel calls HyperThreading, into its chips. Eventually we'll see it on Mac, hopefully, as while some have problems with it I've seen it working and it just rocks for most things, but I've commented on that before. As it stands, OS X would start to have some real problems dealing with this functionality efficiently. The P4 might get more of a benefit from this than say, a G4, but on a G5 it could start to make a huge difference.

Some of this can seem confusing when you hear things like XServes being used in big projects that have a thousand boxes, like say, Big Mac, but it's important to remember the distinction between hooking up a bunch of boxes to work in parallel and using one big box with multiple CPUs. At least as far as the OS is concerned. While they both benefit from the same types of workloads (tasks that can be broken up and run in parallel), a distributed workload over many boxes is a whole different type of deal than what we're talking about.

yummy alcohol posted button Posted by drunkenbatman
    January 23, 2005, at 08:17 PM


Comments (18)




Post a comment



Anonymous comments are allowed, but please enter something for a name.

And do endeavor to appear sane.









Remember personal info?