Of sharing logs and breaking expectations

hosting bandwidth

Back in Of flash crows and sharing logs, I mentioned being approached by a research group at UIUC who wanted me to send over my log files for use in their dual research projects on the characteristics of flash crowds, and their interesting idea of how to alleviate the effects of them on a server.

I wanted to give you an update on what's going on with that (we'll get to the graphic above shortly), as well as a few other things. I go into more specific detail regarding the project in the post above, as well as the problems it's trying to fix, but the gist involves three parts:

  1. The ability to accurately recognize an oncoming flash crowd like, say, from slashdot, extremely quickly. It's not as easy as it sounds, as it's not something you want a lot of false positives on, so you have to work the trade-off between watching patterns by the second or minute... assuming you can accurately recognize the pattern at all.

  2. Kicking the web server into 'overhaul mode', via an apache extension (mod_overhaul), which means it stops serving out straight files but rather breaks them up into tiny pieces and hands those to the clients. HTML, images, everything gets broken up.

  3. Your web browser accepts the pieces given to it by the server, and then forms a P2P network with the other people trying to hit the hammered site. The browsers still accept the pieces from the server, but also share the pieces they've received amongst themselves.

Basically, bittorrent between web browsers, but in a really optimal way. It's only when the site is getting hammered, which means there are tons of peers to trade data so things are fast... it's not like you'll be trying to connect to 5 other peers browsing the website.

Because the server is able to drop down into overhaul mode and pass off tiny little chunks, it's able to handle a much larger amount of clients without melting while saving bandwidth to boot. Of course the devil is in the details, which is why it's being researched.

If you'll recall, I mentioned that I was highly inclined to participate since I'd be able to strip out identifying info from the logs, but originally talked about it on the blog for two reasons:

  • Someone could raise objections or problems I just hadn't thought of, which wouldn't be the first time.

  • There's a difference sometime between reality and expectation. I wasn't so much asking permission in the original post as just letting people know what was going on, as while they're my logs to do with as I wish, people wouldn't necessarily have the expectation that I'd just be passing them off to others.

If your significant other calls you on your cell phone at work, and it turns out you're having a pint at the pub with friends, you've got a problem. It's not that your significant necessarily has a problem that you're enjoying a cold one out in public, or that you don't have a right to be at the pub, it's that they expected you to be at work and you were actually somewhere else.

Anyone whose ever been in a relationship has, at least if they know what's good for them, come to the realization that breaking an expectation can be just as damaging as breaking a promise. More often than not, this is something we learn the hard way, but it's a sound one -- the easiest way to have a relationship become fubar is for the parties in it to be unaware that they have wildly divergent expectations in the relationship.

The term 'relationship' can become mired in semantics, but where there are expectations there's a relationship of sorts. You and I, as writer and reader, have a relationship of sorts with its own expectations. Companies have relationships with their customers, and software developers have a relationship with their users. If all you are going by is what you've committed to on the dotted line and promised -- and ignoring what people actually expect -- you are lying kindling all over the place. It may not actually get lit, but if it does...

...then you have something like the WordPress controversy going on over at Waxy (with a follow-up). No, I'm not ignoring it, and I've had some interesting conversations about it with those who passed it on (before I even saw it in my Waxy feed, impressive-like). These were generally rabid WP friends, and their reactions weren't very pretty.

I like WordPress, I've had some brief interaction with Matt in the past and liked what I saw, and my gut tells me this is going to boil down to the two tenets above: two parties in a relationship with divergent expectations that never got themselves in synch. They're left with the long and patient task of unwinding the clusterfuck, but from what I know of those involved in the project, I believe they'll do it.

We're going to leave it at that, and switch back over to Overhaul and the sharing of the log files.

The process of participating was simple enough; I opted to leave everything as it was in the log files except for the IP address, as things like referrers and file types really go a long way in helping them to filter out content that isn't relative. It was just a matter of getting their perl script, giving it a look over to see it did what they said, asking a friend who knows much more about perl than I to do the same, and then running it against the log files.

It's a fairly rudimentary. Whenever the script encounters a host, it writes it to the hosts file and replaces it in the original log with a sequential numbered pointer. For each host it encounters in the logs it checks its file to see if its seen the same IP before, and if so strips out the IP and replaces it with its matching placeholder. I just have to keep the hosts file around so its added to each time it's run against a set.

Basic and effective; they have the unique entry they need to compile statistics, but since I'm not sending on the hosts file you're just a number to them.

I've been dutifully processing the logs and sending them off (Jay is actually kinda creepy about looking forward to getting the data whenever the site gets hit hard; the term 'delicious dissection' was used), and the good news is that I got to see a copy of the pre-print manuscript of their paper to see what they've been up to.

I'd given it a once over, but wasn't really able to sit down and digest the math and other aspects of what they're up to until last night and this morning. It's absolutely fascinating stuff, my head hasn't hurt this bad since I had to dive into some of the deeper side of Rentz's work, and I'm feeling really good about contributing.

The bad stuff, as usual, comes in threes:

  1. I promised I wouldn't divulge or publish any of the content of the manuscript. Somewhere along the lines I got a reputation as a blogger one can trust not to blog, and it's starting to suck. This stuff is really cool, and my geek-inclined readers would lap it up.

  2. They're working towards completing their research, but then there is the cycle of getting it published. Peer review, that sort of thing. Basically, we're looking at a minimum of six months until it's out.

  3. Apparently the site's erratic traffic pattern shows off the flash mob thing really well, which at least means the site got into the paper as it stands. This would normally be a win-win, but Jay and I are going to have to haggle a bit over naming.

    Realistically, this is perhaps the one and only chance of ever getting 'drunkenbatman' or 'DrunkenBlog' in a serious academic paper that doesn't involve criminal profiling, and that'd be one hell of a cool hack.

    We've been on national TV and a national newspaper, so this is virgin soil. The research group, perhaps understandably, is not as gung-ho about the idea as I am. Jay mentioned a possible compromise, so we'll see.

However, I couldn't completely leave you hanging, which brings us to the strange graphic you saw at the top of the post. If you clicky below, you'll see what the traffic pattern of DrunkenBlog looks like in the first three minutes of a flash mob hitting it. I don't recall which one it was, but it's pretty damn cool nonetheless.

hosting bandwidth

It's focused on bandwidth, rather than hits, because the goal of the research is have a site slipping into Overhaul Mode use the same amount of bandwidth while being hit with a flash mob as the site would be using normally.

My hope is to be able to talk him into giving me a nice fat .EPS of a minute's time with that granularity or even -- dare I dream -- five to twenty minutes worth all laid down so I can print it out and run it along the top of the four walls. I know, I'm a dork.

Anyways, ignoring my decorating dreams, these guys are putting your hits to good use, and I'll keep trying to get them to part with tidbits.

yummy alcohol posted button Posted by drunkenbatman
    April 05, 2005, at 09:57 AM


Comments (16)




Post a comment



Anonymous comments are allowed, but please enter something for a name.

And do endeavor to appear sane.









Remember personal info?