Of flash crowds and sharings logs

flash mobs

A few weeks ago, I received an email asking me to hand over my logs. It's not quite what you think, although my brain went there too. Rather it's a University group doing research on the characteristics of flash mobs...

I am a 2nd year Computer Science PhD student at the University of Illinois, Urbana-Champaign (UIUC). I am working under the guidance of Dr. Indranil Gupta with a specific focus in area of distributed protocols.

Our research group has already done work relating to flash crowds previously that was published in the 9th International Workshop on Web Content Caching and Distribution (WCW2004). The work was entitled "Overhaul: Extending HTTP to Combat Flash Crowds."

For an upcoming research project, I am working alongside my colleague (Charles Yang, a 1st year student) on characterizing the properties of flash crowds (i.e., slashdot effect). For this purpose, we are seeking help of web masters and authors of popular web sites (like your site, drunkenblog.com which was linked off slashdot recently [Jan 8 '05]) to provide us with web logs. We are specifically seeking:

1. Long term logs that have at least one (possibly many) flash crowd that was referred by either a single source or multiple sources (the later strongly desired, if possible)
2. We would like to analyze the normal traffic on the web site (i.e., months or weeks prior to the flash crowd), obviously the direct effect of the flash crowd, and the long-term effects (months or weeks following a flash crowd). We would be grateful if you could provide us with the most extensive logs possible.

We also understand that this request might not settle in easily on you. We are ready to provide certain guarantees. We can list the obvious ones below and are open to any further restrictions:

1. We intend to keep the web logs private (unless you explicitly mention otherwise). The only people who will have access to these logs will be the researchers involved in this project (Dr. Gupta, Charles, and I).
2. We will not reveal the name, URL, or any specific traffic and non-traffic related characteristics of your site in any direct form.
3. We will fully acknowledge your help (i.e., "plug" your web site) and generosity in any publication that might result as part of our research.
4. For the privacy conscious, we can provide a sanitizing script to preserve the privacy of your users and resources, if you so desire.
5. We can provide a uiuc.edu FTP site for you to upload your logs.
6. We are willing to bide by additional clauses.

We hope that you are willing to help us in the spirit of extending academic research. We would be exceedingly grateful if you could work with us.

Regards,
Jay Patel

Thing is, it seems pretty legit, and I'm quite inclined to help them by handing over the logs. There are a few weird things about DrunkenBlog in this respect, though:

  • I don't really know how 'popular' DrunkenBlog really is, at least in general terms. I've come to learn that a surprisingly amount of people know about it, but they're generally pretty plugged in and not necessarily your average Joe. I.E., the vast majority of regular readers to the site are those who are into things like RSS... and I'm apparently really (in)famous at Apple, but the front page isn't really seen that much.
  • The traffic patterns for DrunkenBlog would probably drive a researcher mad, as sometimes I'll go a few weeks without posting, and sometimes there'll be ten posts in one day. Some days it's 150,000 page views and some days it's 1,000. Some days I'm posting on threading and funnels in the OS X kernel, and some days I'm posting about nude students overseas. All part of my plans to lull you into a false sense of security before the next big thing...
  • I don't really keep log files, as much fun as it can be to geek out on them... doesn't seem like a good idea in today's society to hold onto them. I generally keep two weeks worth, and that includes the current week. I even have a script to go through and 'sanitize' IP addresses from comments posted as anonymous after a day or so, because I figure that while some post as anon for convenience and such, others have valid reasons for not wanting things to come back to them. If they aren't going after someone, no reason to keep it.

The particular log file he was after was already gone, but I do have a nice big fat one from the last few days, plus the last week, that could be passed on.

I've gone through their paper, and it's pretty interesting stuff. They've worked on a technology called 'Overhaul', with Apache as their test bed (via mod_overhaul), that implements an extension to the HTTP protocol that is almost creepily like a swarm client -- aka, bittorrent -- in order to handle a flash-mob of traffic.

The idea is that when a site gets rapidly bombarded by a stampede of client requests, the server slips into 'Overhaul Mode' and breaks the document/file/image into multiple small chunks and distributes those to the clients requesting the file. The clients then connect to each other (via DHTTP) to get the data they need to form a complete document, which offloads much of the file-transferring work from the server and allows it to stay responsive to the requests coming in.

Think about it this way. Let's say something on DrunkenBlog gets listed on several high-traffic sites in an overlapping manner. Each client browser connects to the server to download the HTML page, and then render it. A normal page might have several files -- the html of the page, some images, etc. If you have keepalives on, the client stays connected to the server while it snatches those in quick succession -- if you have them off, the client has to reconnect for each one.

Seems smart to automatically keep keepalives on, but consider that you generally need to limit how many instances of apache are running so that when thousand of people are trying to connect at once, a thousand instances don't get spawned, which would kill the server. If a bunch of people with modems are tying up an apache process... anyways, gone into all this before.

Long and short, things are either going to get really slow or you'll starve clients out -- the server just won't be able to deal with their requests before the browser decides the server has gone stupid.

With Overhaul, when this happened the server would kick into a different mode and break up the text of that page that has gone hyper-popular, and whatever files and images that are on it, into small chunks and serve those out. So if 10,000 clients were all requesting the page, they'd each only get a small chunk and then talk amongst themselves in order to get all the chunks they needed for the browser to render the page... all depending upon how overloaded the server was.

If a server would normally start to just die at serving 10,000 clients in a five minute period, this might allow it to be responsive to much, much more traffic than that as while the clients can pull chunks from the server, they're pulling chunks from all the other clients connecting...

It's a really neat idea that I'm boiling down in a way that probably doesn't do it justice, but suffice to say I'm fascinated and really inclined to try to help them out.

However, while it's my data, my gut says I should ask my readers if they'd have any problems with it. While I may have the right to do what I want with it, it doesn't mean it's something I should do...

Assuming I got the scripts to sanitize the logs, and they seemed to be legit in filtering out identifying information, would any of you have a problem with it?

yummy alcohol posted button Posted by drunkenbatman
    February 27, 2005, at 12:09 AM


Comments (20)




Post a comment



Anonymous comments are allowed, but please enter something for a name.

And do endeavor to appear sane.









Remember personal info?