Funneled Performance
I mentioned some of the growing pains the other 'nixes are going through in terms of SMP support, and it's worth noting that Linux, the BSDs and Windows aren't alone in this. OS X is still just hitting its adolescence in terms of SMP and will hopefully be taking its own little step further into adulthood with 10.4. This might seem a little weird, as everyone has heard that OS X on a dually kicks much more ass over a single CPU.
Yes, OS X has SMP, a sort of weird variety of it. It also has nice preemptive multitasking, so your system doesn't grind to a halt if you're doing multiple things... theoretically. If you'll remember, OS X isn't FreeBSD 4.4 or FreeBSD 5.
It's a microkernel called Mach, even though it's only used as a pseudo-microkernel, with a FreeBSD 4.4x userland bolted on top of it, with bits of FreeBSD 5 thrown in. It's the bolted-on part that gets a little weird, due to something called 'funnels'.
When you stop and think about it, just bolting on a BSD userland onto a separate kernel is a little weird. Normally BSD would have its own kernel, but in this case it's Mach... and they have to communicate somehow. They do it through something called a 'Funnel', which is an abstraction to serialize and sync things between the two. Apple had to rip out and then rework how BSD would normally sync itself to the kernel.
The funnel situation is one of the reasons why I've always sort of laughed at the quad or eight-CPU rumors that circulate every once in awhile. Yes, people from Apple have said things like "We've tested Mach on up to 20 CPUs and it scaled beautifully", and they weren't lying. But there is no way in hell the current OS would ever scale up like that. Not because Mach isn't capable of it, but because of the way BSD and Mac are currently integrated together.
If you're inclined, there's a decent write-up by an Apple engineer on how and why much of this came to be over here:
4. Funnels: Serializing access to BSD
Funnels are quite possibly one of the most confusing elements of xnu for people familiar with other BSD kernels. They are not a lock in the traditional sense of the word (though they are sometimes referred to as ``flock'' within the kernel). Funnels are used to serialize access to the BSD segment of the kernel. This is necessary because that portion on the codebase does not have fine-grained locking, and is not fully reentrant. There are currently two funnels within the kernel, the kernel funnel (it might be more appropriate to call it the filesystem funnel, though it does protect a few calls besids the file systems), and the network funnel.
Think of it this way. When a thread is given some CPU time, it establishes a lock to the kernel with the funnel abstraction. It owns the kernel in a way, for the time it is doing its business, because it owns the funnel to the BSD-ish part of the kernel. Anything making a system call, directly or indirectly, takes the funnel and nothing else can really do its business because something else has a lock on the doorway to the kernel.
The funnel is mapped to various 'mutexes' living in the Mach kernel, or rather is a mutex, and since a mutex is often talked about threading and such comes up but rarely explained... a mutex stands for 'mutual exclusion object'. It's an imaginary object that lives in the kernel and allows multiple threads to share the same resource... like file access APIs. However, it can't be done simultaneously. It's all serialized. One thread can have file access, then another, but not at the same time.
The funnel is what keeps more than one thread from the BSD-side of things from running within the kernel at the same time. Remember, the BSD side of things couldn't really be trusted, so only one thread could be run at a time so everything wouldn't go fubar. The BSDs had threading, but everything changes when you start adding in more CPUs, either virtually with HyperThreading or physically with multiple CPUs.
As mentioned in a prior post, the appearance of things running at the same time on a single-CPU box is an illusion, but with a multi-CPU box it becomes a reality and really bad things can happen with threads if that isn't taken into consideration. As a quick detour, there are a few different types of threads in OS X, which exist in a hierarchy and are often layered over top of one another. Here they are, in order:
- Mach Threads, which are the lowest level. User-space apps don't create these things directly.
- POSIX threads (pthreads) which get layered on top of Mach threads. Search the site on Posix for more.
- NSThreads & TS Threads.
This one is a little weird, but basically NSThreads are Cocoa threads. TS Threads is an abstraction called the Thread Manager that allows Carbon apps to do things they may need to coming from OS9, or at least in a different way than you'd do normally. IE, remember a lot of apps in OS9 were based on cooperative multitasking, not preemptive, so things get weird.
In any event, these get layered over pthreads, which get layered over Mach threads. Some of the Carbon business gets weird, as if the thread is created internally by Carbon it just wraps a pthread, but if the thread is created by Carbon APIs at the request of a Carbon app, it has to go through the Thread Manager or with an MP layer task... it just gets very 'expensive' in terms of CPU time. This can make sense when you realize you're emulating a cooperative-based system on a preemptive-based system.
Now all of these various threads need to get run, and the Mach scheduler doesn't really care what type of thread it is, although some of them can get very expensive. They are all equal in its eyes, with the exception of high-priority threads which can preempt everything as long as it doesn't need to lock itself to a funnel. There's one big thing here: The scheduler will not schedule a thread for time if it is blocking in some way while waiting for something. Like, say on I/O.
To give an example, the part of the kernel that handles file system access might be started, and a unique mutex created for it. Going through a funnel, a thread can connect to it, creating a lock, but while its doing it nothing else can have file system access, or I/O. And the scheduler is not going to run anything that is blocking on I/O, which means the system just gets choked up.
When you stop and think about it, you can start to get an idea of where you can start to get lots of problems once you have a lot of things going on, especially when you're dealing with more than one CPU. However, things aren't quite so dire. There are a few things keeping it from just being an egregious killer:
- There isn't just one funnel anymore, now there are two. One for normal userland things, and one for network things. Originally there was only one one, but because there are two there's a big performance benefit if your app/task uses both the disk and the network.
- IOKit, Apple's sub-system for drivers, communicates with Mach via its own scheme, which is considered to be much finer-grained. As I mentioned, it's a pseudo-microkernel now, which means these are basically Mach threads, owned by the kernel.
Still, the situation is far from ideal, and causes a lot of performance killers, and you can probably imagine what happens when both of the funnels get locked onto at the same time. It's not just a problem with boxes that have multiple-CPUs, that just exacerbates it. This is one of the reasons why you can see the performance of OS X just dive, depending on what you're doing. You might only be using a bit of the CPU, but because of the funnel problem, everything goes herky-jerky and non-responsive. What you're doing on the system shouldn't be doing it, and it wouldn't do it on say, Linux, but it's happening anyways. Funnels.
Now, while it kinda sucks, the two-funnel thing is basically a stop-gap solution. FreeBSD at the time just had serious problems when it came to SMP, and as I mentioned is still just really getting the ball rolling. FreeBSD v5 is considered to have extremely capable SMP support now, but not then. Apple had to do something in order to have SMP, but the BSD subsystem it was attaching to was not only not efficient when it came to SMP, much of the code wasn't written in a way that would be safe. So we got funnels, but it really was an interim solution while everyone got their ducks in a row.
Let me reiterate -- this is a case where the solution is actually fairly elegant given the terms of the problem. It's just far from ideal.
Now in 10.4, Apple has stated that there'll be finer-grained SMP support, but hasn't really gone into great detail on what that means and how it'll be provided. Will there be more funnels? Will the funnel situation be going away via a different type of glue to a more modern BSD subsystem?
No real clue, as if Apple has said it anywhere publicly I haven't seen it. If it's the real deal, and not just a minor bump, there could be interesting implications:
- Smoother, more predictable performance
Those performance killers I mentioned are real, and all you have to do to see them yourself is really smack the file system hard from one app and then try to do things in another. Blocking on I/O is bad, mkay? An iBook makes an idea test case. :) - The big boxes
As mentioned, I've always sort of laughed at the 'big box' rumors that circulate from time to time. OS X, as it currently stands, would start to run into some real problems with a quad-CPU box in terms of efficiency, let alone something like an 8-CPU behemoth. Those are such specialized boxes in any event, but the door would certainly open wider for Apple to push them. Remember, Linux really had to grow into this too, with heavy help from IBM and others.
Because of how expensive these boxes are, and how targeted they are, and how minor Apple's presence in this world is... it's still hard for me to picture them doing it, but it starts to become reasonable that it could be thrown on IBM hardware. - Multi-Core
Much of the progress towards CPU development has been towards multi-core CPUs, which do have a lot going for them. However when you take a dually-CPU system with dual-core chips you're starting to talk 4 cores needing to be fed and tended by the OS. Your OS really needs to be efficient with them. We haven't seen much word of this on Apple's side. IBM is pushing them for the Power5 and such, but those boxes generally have Linux or it's own AIX unix. Motorola has dual-core G4s in the works, but those are generally going towards the embedded sector. - HyperThreading
IBM has started to push SMT, or what Intel calls HyperThreading, into its chips. Eventually we'll see it on Mac, hopefully, as while some have problems with it I've seen it working and it just rocks for most things, but I've commented on that before. As it stands, OS X would start to have some real problems dealing with this functionality efficiently. The P4 might get more of a benefit from this than say, a G4, but on a G5 it could start to make a huge difference.
Some of this can seem confusing when you hear things like XServes being used in big projects that have a thousand boxes, like say, Big Mac, but it's important to remember the distinction between hooking up a bunch of boxes to work in parallel and using one big box with multiple CPUs. At least as far as the OS is concerned. While they both benefit from the same types of workloads (tasks that can be broken up and run in parallel), a distributed workload over many boxes is a whole different type of deal than what we're talking about.
Comments (18)
Posted by: J Coop at January 23, 2005 09:12 PM
No offense to you Mr. DB, but if these 'funnels' existed and were such a problem wouldn't they have come up before now? I have heard =every= complaint in OS X about speed and never once had this mentioned.
Posted by: Cap'n Hector at January 23, 2005 09:59 PM
Can you provide some docs, please?
Posted by: Twist at January 23, 2005 10:05 PM
I guess you were probably writing this up at about the same time I was writing my comments on the last article. This one is a bit over my head but I am willing to bet I have run into the problem quite often on my iBook. Especially in the not so great Epson Scanner software that I use. It works fine as long as it is the front-most app during the scanning process but try to do something else at the same time and basically it and everything else stops working. Oddly enough repeatedly clicking the title bar of the scanner progress window throughout the rest of the scan will get it to finish and free the rest of the system from its evil clutches.
Posted by: drunkenbatman at January 23, 2005 11:39 PM
I linked to one article I had in my notes from ages ago -- unfortunately most of this was a brain dump as usual. However, googling for "os x funnel kernel" shows quite a bit that would be applicable and getting you in the right direction. So I'll just let ya. :)
Posted by: Patrick Quinn-Graham at January 24, 2005 12:52 AM
I've noticed this - and in fact the time when it becomes most obvious is when you have a dying hard drive. It actually explains quite a lot.
Posted by: C. at January 24, 2005 04:21 AM
On one of the Tiger talks they talked about this. As the whole thing is under NDA I can't say anything about it other than to confirm that Apple also felt the earlier SMP was quite a hack and that they have made a huge effort to improve the situation in 10.4.
I am guessing this is going to be quite a priority for them in the future.
Posted by: Joe at January 24, 2005 06:49 AM
Wow, you do know your stuff.
*Mods you +5, informative* oh, wait :-P.
-- Joe
Posted by: Devin Coughlin at January 24, 2005 05:52 PM
The important thing about funnels is that they aren't just a lock. If a thread that has the funnel blocks for some reason (perhaps it is waiting on some other lock) it gives up the funnel while it is blocking and only gets it back when it is rescheduled. This means a thread can't lock up the whole system while it waits for a resource (unless it uses a spinlock), but it does require the programmer to be aware that state can change across a blocking call.
It's a really interesting compromise between enforcing true mutual exclusion on the subsystem covered by the funnel and having fine-grained locks for everything shared resource in the subsystem.
Posted by: Othman at January 25, 2005 12:27 AM
Nice write-up! I believe I might have experienced this on my iBook way too many times for my liking... basically everytime the installer optimizes the disk's performance (pre-binding), my machine becomes completely unusable for minutes a time. The 4200RPM hard drive probably doesn't help. Am I wrong this might have something to do with it?
Posted by: Fazal Majid at January 26, 2005 03:23 AM
Another easy way of seeing this slow-down in action is when you insert an audio CD and wait for iTunes to detect it. For a few seconds, the system slows down to molasses (except iTunes playback, oddly enough), and I have a dual 2GHz G5. I guess the funnel is held while the OS is spinning up the CD drive and accessing whatever volume metadata is needed, thus blocking any other thread needing disk I/O.
Posted by: Oliver C at January 26, 2005 04:03 AM
Damn! Good job of taking something arcane and making it understandable by we commoners. You also restrained yourself in length but I would hope you would flesh some of these topics out in the future...
Posted by: Jay Jay at January 29, 2005 07:25 AM
Much like the CD spin-up, would this also be the reason my system (original 15" LCD iMac + external FW drive) "locks up" and shows the spinning beachball when the external drive is spinning up from sleep? Hmmm, very interesting. Am I also on the right track by reading between the lines of previous comments that this situation will become less of a situation with 10.4? I certainly hope so.
JJ
Posted by: Eug at March 12, 2005 07:34 PM
Somebody at Ars pointed me in this direction...
I was wondering if you would have any comment on the new 4 "CPU" support in Apple's CHUD tools (version 4.1). I'm guessing that the improvements in Tiger would make the dual dual-core Macs more feasible, but like I said, I'm just guessing.
Eug
Posted by: David Smith at March 12, 2005 08:16 PM
Fantastic article. It explains so much about what I've seen with OSX performance, especially in low ram and slow disk situations.
Posted by: Jay Jay at April 10, 2005 07:57 AM
I went to a WWDC preview here in Sydney on the 6th and fine-grained locking was shown on a slide and talked about briefly. Maybe the era of 2+ CPUs is imminent?
JJ
Posted by: Macalicious at April 14, 2005 09:30 PM
My nipples are exploding with pleasure!
Posted by: Vincent at June 12, 2005 04:06 PM
Could you update this very interesting article with tiger's beahviour on the matter ?
Thanks








Damn. You know a lot of stuff about a lot of stuff!