Of sharing logs and breaking expectations
Back in Of flash crows and sharing logs, I mentioned being approached by a research group at UIUC who wanted me to send over my log files for use in their dual research projects on the characteristics of flash crowds, and their interesting idea of how to alleviate the effects of them on a server.
I wanted to give you an update on what's going on with that (we'll get to the graphic above shortly), as well as a few other things. I go into more specific detail regarding the project in the post above, as well as the problems it's trying to fix, but the gist involves three parts:
- The ability to accurately recognize an oncoming flash crowd like, say, from slashdot, extremely quickly. It's not as easy as it sounds, as it's not something you want a lot of false positives on, so you have to work the trade-off between watching patterns by the second or minute... assuming you can accurately recognize the pattern at all.
- Kicking the web server into 'overhaul mode', via an apache extension (mod_overhaul), which means it stops serving out straight files but rather breaks them up into tiny pieces and hands those to the clients. HTML, images, everything gets broken up.
- Your web browser accepts the pieces given to it by the server, and then forms a P2P network with the other people trying to hit the hammered site. The browsers still accept the pieces from the server, but also share the pieces they've received amongst themselves.
Basically, bittorrent between web browsers, but in a really optimal way. It's only when the site is getting hammered, which means there are tons of peers to trade data so things are fast... it's not like you'll be trying to connect to 5 other peers browsing the website.
Because the server is able to drop down into overhaul mode and pass off tiny little chunks, it's able to handle a much larger amount of clients without melting while saving bandwidth to boot. Of course the devil is in the details, which is why it's being researched.
If you'll recall, I mentioned that I was highly inclined to participate since I'd be able to strip out identifying info from the logs, but originally talked about it on the blog for two reasons:
- Someone could raise objections or problems I just hadn't thought of, which wouldn't be the first time.
- There's a difference sometime between reality and expectation. I wasn't so much asking permission in the original post as just letting people know what was going on, as while they're my logs to do with as I wish, people wouldn't necessarily have the expectation that I'd just be passing them off to others.
If your significant other calls you on your cell phone at work, and it turns out you're having a pint at the pub with friends, you've got a problem. It's not that your significant necessarily has a problem that you're enjoying a cold one out in public, or that you don't have a right to be at the pub, it's that they expected you to be at work and you were actually somewhere else.
Anyone whose ever been in a relationship has, at least if they know what's good for them, come to the realization that breaking an expectation can be just as damaging as breaking a promise. More often than not, this is something we learn the hard way, but it's a sound one -- the easiest way to have a relationship become fubar is for the parties in it to be unaware that they have wildly divergent expectations in the relationship.
The term 'relationship' can become mired in semantics, but where there are expectations there's a relationship of sorts. You and I, as writer and reader, have a relationship of sorts with its own expectations. Companies have relationships with their customers, and software developers have a relationship with their users. If all you are going by is what you've committed to on the dotted line and promised -- and ignoring what people actually expect -- you are lying kindling all over the place. It may not actually get lit, but if it does...
...then you have something like the WordPress controversy going on over at Waxy (with a follow-up). No, I'm not ignoring it, and I've had some interesting conversations about it with those who passed it on (before I even saw it in my Waxy feed, impressive-like). These were generally rabid WP friends, and their reactions weren't very pretty.
I like WordPress, I've had some brief interaction with Matt in the past and liked what I saw, and my gut tells me this is going to boil down to the two tenets above: two parties in a relationship with divergent expectations that never got themselves in synch. They're left with the long and patient task of unwinding the clusterfuck, but from what I know of those involved in the project, I believe they'll do it.
We're going to leave it at that, and switch back over to Overhaul and the sharing of the log files.
The process of participating was simple enough; I opted to leave everything as it was in the log files except for the IP address, as things like referrers and file types really go a long way in helping them to filter out content that isn't relative. It was just a matter of getting their perl script, giving it a look over to see it did what they said, asking a friend who knows much more about perl than I to do the same, and then running it against the log files.
It's a fairly rudimentary. Whenever the script encounters a host, it writes it to the hosts file and replaces it in the original log with a sequential numbered pointer. For each host it encounters in the logs it checks its file to see if its seen the same IP before, and if so strips out the IP and replaces it with its matching placeholder. I just have to keep the hosts file around so its added to each time it's run against a set.
Basic and effective; they have the unique entry they need to compile statistics, but since I'm not sending on the hosts file you're just a number to them.
I've been dutifully processing the logs and sending them off (Jay is actually kinda creepy about looking forward to getting the data whenever the site gets hit hard; the term 'delicious dissection' was used), and the good news is that I got to see a copy of the pre-print manuscript of their paper to see what they've been up to.
I'd given it a once over, but wasn't really able to sit down and digest the math and other aspects of what they're up to until last night and this morning. It's absolutely fascinating stuff, my head hasn't hurt this bad since I had to dive into some of the deeper side of Rentz's work, and I'm feeling really good about contributing.
The bad stuff, as usual, comes in threes:
- I promised I wouldn't divulge or publish any of the content of the manuscript. Somewhere along the lines I got a reputation as a blogger one can trust not to blog, and it's starting to suck. This stuff is really cool, and my geek-inclined readers would lap it up.
- They're working towards completing their research, but then there is the cycle of getting it published. Peer review, that sort of thing. Basically, we're looking at a minimum of six months until it's out.
- Apparently the site's erratic traffic pattern shows off the flash mob thing really well, which at least means the site got into the paper as it stands. This would normally be a win-win, but Jay and I are going to have to haggle a bit over naming.
Realistically, this is perhaps the one and only chance of ever getting 'drunkenbatman' or 'DrunkenBlog' in a serious academic paper that doesn't involve criminal profiling, and that'd be one hell of a cool hack.
We've been on national TV and a national newspaper, so this is virgin soil. The research group, perhaps understandably, is not as gung-ho about the idea as I am. Jay mentioned a possible compromise, so we'll see.
However, I couldn't completely leave you hanging, which brings us to the strange graphic you saw at the top of the post. If you clicky below, you'll see what the traffic pattern of DrunkenBlog looks like in the first three minutes of a flash mob hitting it. I don't recall which one it was, but it's pretty damn cool nonetheless.
It's focused on bandwidth, rather than hits, because the goal of the research is have a site slipping into Overhaul Mode use the same amount of bandwidth while being hit with a flash mob as the site would be using normally.
My hope is to be able to talk him into giving me a nice fat .EPS of a minute's time with that granularity or even -- dare I dream -- five to twenty minutes worth all laid down so I can print it out and run it along the top of the four walls. I know, I'm a dork.
Anyways, ignoring my decorating dreams, these guys are putting your hits to good use, and I'll keep trying to get them to part with tidbits.
Comments (16)
Posted by: defiantai (AIM) at April 5, 2005 11:11 AM
forgot my name, sorry
Posted by: ffbt at April 5, 2005 11:14 AM
WTF Michael? You were yalking to me at 3am... Get some sleep, some of your sentences are atrocious in this
Posted by: nessence at April 5, 2005 11:55 AM
I'd hate to ruin it for them. I think 'flash crowds' and dealing with them is going to be an easy problem to solve and not require the complexities of peer to peer. I think, if you can put me in touch with them, I'd be glad to tear into it. Not to say I could give them mathematical functions to go with the information but I've dealt with flash crowds myself. You see, that's the actual problem - I'm not virgin to dealing with flash crowd. A flash crowd is only a flash crowd if the person is not prepared for the crowd. mod_overhaul wouldn't help drunkenblog - or any other blog - anymore than the things you do before you know you're going to get slammed. The next issue is that, from what I have read, their entire research is based on bandwidth. The real killer with flash crowds is the load on the server, not the bandwidth - you could have told them that on day one. Bandwidth /is/ usually only a factor when you are with a shared hosting provider - in which case you are on a server with other users. Bandwidth is over time and most shared environments will cut off your access if you load down the box before you run out of bandwidth. Next is the environment. You can have a dedicated server, shared hosting, hosted by a friend, hosted at your home, or hosted by something like a university or a communal site such as a blog service or maybe a company directory. If this were 1999 I would understand bandwidth to be a problem, but today, the only one of those scenarios wherein bandwidth would starve a flash crowd is if you're hosted off your cable modem or dsl. This isn't really a valid case though because it is no longer a stable or legitimate place to host a site in the first place. Very few ISPs still let you run a 'dedicated' web site and effortlessly cut off your connection whether you're using mod_overhaul or not (to them, the mere number of packets induces load, not the bandwidth).
Then there is the HTTP server itself. Apache can handle 300 concurrent sessions and hoses your box in the process - lighttpd however will handle 800 sessions without a sweat. WordPress can handle a tremendous load, especially with staticize plugin - however moveable type will make you wish you never touched a server in your life (it will kill your server and beat it like a dead horse).
Last, there is the browser and the HTTP protocol. This is more religious than it is informal, but wrapping p2p around http and a browser is like trying to drive down the road couting every single stripe on the road without losing count.
Back to flash crowds and why something like mod_overhaul is a vaporconcept. Server hardware and bandwidth is becoming cheap. It's like the opposite of oil prices. 10 years ago it costs $20k for a dual processor box, some ram, and a raid array. Today, you can get a box 100 times the power for 1/10th the price. That's only in ten years. 6 months to a year from now, a) servers, and thus hosting costs will be cheaper, and b) bandwidth will be cheaper - /especially/ in a few years. Every telco in the US is starting to deploy faster broadband and fiber. It's not just to make things faster - they are taking all of the old equipment (1.5mbps dsl, etc) and relocating it to the 'last mile'. Basically, they're throwing in fiber or vdsl in place of the usual dsl equipment and throwing the usual dsl equipment into those areas where users can barely get dialup. The real catch is that the telcos are going to start offering TV and Video on Demand service over the high speed stuff in order to bankroll the last mile. Meanwhile, we're making faster and more efficient compression mechanisms to keep all those dialup users happy. So, let's face it - bandwidth is getting cheaper and will continue to do so for a very long time.
Last, the load of p2p is cpu-bound. If a server can't handle a flash crowd (assuming bandwidth is unlimited) then it surely won't handle a tracker for the p2p system. Unless the tracker was centralized and then your stuck with either a) a massively distributed index/tracking system (which doesn't yet exist...gnutella is close) or b) a centrally managed system where there will be an authority with power and power corrupts ('nuff said). If none of this were an issue - and mod_overhaul was complete - you'd have the issue of people configuring their systems for mod_overhaul. No doubt, such a system would either have to a) have a self-hosted tracker (which we deem void b/c if it can't handle the flash crowd it won't handle the load of a tracker) or b) a central tracker which would require a really nice public key infrastructure to be anything close to secure. We can't even get people to use PGP - and they barely are able to handle BitTorrent, let alone having a system that requires both.
If that's not enough, you'd have to get acceptance of mod_overhaul by shared hosting providers or get your own dedicated server. In the latter case your inability to handle a flash crowd would be your technical inablities in which case if you knew how to setup mod_overhaul you'd be able to setup your system for a flash crowd. In the former case, shared hosting providers are taking more than a year to get php5 and mysql4.1.1 support let alone adding in JAAM (Just Another Apache Module).
Could such a thing theoretically solve the problem on paper? Of course. Will it happen IRL? Nope. If it does - it's necessity won't be long lived.
BTW, I got
`MT::App::Comments=HASH(0x8283334) Use of uninitialized value in sprintf at lib/MT/Template/Context.pm line 1187.` when previewing my post.
Posted by: primary0 at April 5, 2005 01:22 PM
in the graphic, between 01:57:00 and 01:57:45, it looks like as if batman stopped by. sorry, but i see quite a lot of batman outlines out in the world.
Posted by: Ben Donley at April 5, 2005 01:28 PM
nessence, you're working for a hosting provider. Your picture of the web may be slightly skewed. Half the time, when something gets squished by slashdot, it's because they *are* basically serving static content over a cable modem. Yes, that problem may go away completely for managed hosting, but it will be a while before most users have enough upstream bandwidth to serve content.
I don't understand why they're so concerned with only "going into overhaul mode" when there's a bandwidth problem. Seems like if their algos were going to scale, they should be able to scale up and down just as well, and that is why they are CS researchers while I code VB for a living. Er...
Posted by: FredB at April 5, 2005 02:42 PM
At the start of the nessence's post, I thought, "This is Jason from TXD". Then saw the sig and thought, oh no. Then clicked the sig and textdrive.com loaded so: If you're Jason, I trust every word you wrote cause you know what you're talking about. If you're not Jason, you're his twin. ;)
Anyway this sound realist to me.
Do those guys making the research really forgot to talk to people working in hosting companies that have been slashdoted numerous times? It's nice to ask data to bloggers, but it seems necessary to me to talk to the guys handling those situations in the real life...
Posted by: nessence at April 5, 2005 03:15 PM
Yes, I'm bias and my view is skewed. I'm aka 'AlexL'.
I believe in what Ben says but I also know that only the small cable providers 'allow' you to server your own site (and almost all block hosting one's email).
My primary concern with their research is that a flash crowd is almost a growing pain for a site...usually it only happens a little bit and the site grows up to handle the crowds or dies and there is no crowd. Flash crowds in the scope of social interactions, memes, or marketing is one thing; flash crowds with regard to spikes in bandwidth is trivial and relational to the author's content and/or the hosting providers' expertise. A user on a cable modem with a crowd almost always ends up with a hosting provider (or, sometimes, get disconnected).
Posted by: stripes at April 5, 2005 03:41 PM
I don't understand why they're so concerned with only "going into overhaul mode" when there's a bandwidth problem.
A P2P system is going to have more latency then a simple "give me your content" fetch on an unloaded pipe. It'll use more bandwidth too since all the peers have their own bits of communication overhead plus the overhead of being told what bits to fetch and their checksums. So if you don't need to use it, you are better off without it.
Posted by: Melkor at April 5, 2005 04:32 PM
I think you meant to say "tenet" instead of "tenant." I apologize if that was a typo, since I'm not out to nitpick; I'm just doing what I can to spread the grammatical love, so to speak.
Posted by: phaetor at April 5, 2005 07:42 PM
Not to say I could give them mathematical functions to go with the information but I've dealt with flash crowds myself. You see, that's the actual problem - I'm not virgin to dealing with flash crowd. A flash crowd is only a flash crowd if the person is not prepared for the crowd. mod_overhaul wouldn't help drunkenblog
You do not realize that you are coming out as a rube by talking like this. DB certainly has much experience with a flash mob effect and has seen their research and shown it has promise. The original link has a PDF which shows their introductory research and mod_overhaul working to good effect. Yet you, someone with no real background except what you say (downloading on LImwire?) say it is all bogus and want to save them all the trouble. Sigh. :)
Posted by: Bill Hamm at April 5, 2005 07:50 PM
Stop being silly, this is like when a bunch of people at MacRumors start telling Virginia Tech exactly how they should build their cluster. They are doing actual research with actual software and benchmarking it and seeing real results yet you people claim (because you work at a hosting company that has been open less than a year?) that it is all hogwash. You show fundamental misunderstandings of what is being done.
Posted by: Ankalon at April 6, 2005 07:27 AM
With the type of stories Slashdot is covering these days, you're going to get smacked on this story. I can see it. Oh, the irony, a story about experimental bandwidth tactics ultimately kills itself. ;) Some of their stories seem to have come from the IT Weekly World News, though.
Posted by: nordsieck at April 6, 2005 10:37 AM
The main problem with flash crowds with servers is cpu load. Servers like Apache that use a thread/process per connection, litterally ddos themselves becaues of http timeout issues. this chart is a bit old, but it illustrates the point nicely.
The other thing is: why have a mod_overdrive at all? Why not just add native bittorrent support into the client? The server wouldn't have to do a thing.
Posted by: nessence at April 6, 2005 01:05 PM
*sigh*
nobody knows what I've done or the extent of UIUC's research. flaming one or the other is wasteful. just like nordsieck, the post was to point out an indifference and opinion on the subject. Albeit saying it'll be unrealistic off of paper (or research programming, same thing) is a bit more than an opinion but I can be faulted for not posting any of my own research.
DB, do you think bandwidth is the problem with the /. effect, or your server's setup in general?
I want their project to succeed with all of it's efforts. But I surely don't want it caught up on something trivial.
Posted by: Peter da Silva at April 7, 2005 10:26 AM
My problems with flash crowds have been bandwidth.
If you're serving static content then unless you're running a webserver on a 486 you can serve way more content than you can afford to serve... and then you hit your cap and you go offline or you start feeling the riptide hit your wallet.
If you're not serving static content, one way or another, you're doomed if a flash crowd hits. The first line of defense is to be able to switch to static content, and now you've reduced the problem to the previous one.









Small error: The graphic shows the first three minutes of a flash mob, not seconds... if I'm interpreting the bar below the graphic correct, that is.