Of flash crowds and sharings logs

A few weeks ago, I received an email asking me to hand over my logs. It's not quite what you think, although my brain went there too. Rather it's a University group doing research on the characteristics of flash mobs...
I am a 2nd year Computer Science PhD student at the University of Illinois, Urbana-Champaign (UIUC). I am working under the guidance of Dr. Indranil Gupta with a specific focus in area of distributed protocols.Our research group has already done work relating to flash crowds previously that was published in the 9th International Workshop on Web Content Caching and Distribution (WCW2004). The work was entitled "Overhaul: Extending HTTP to Combat Flash Crowds."
For an upcoming research project, I am working alongside my colleague (Charles Yang, a 1st year student) on characterizing the properties of flash crowds (i.e., slashdot effect). For this purpose, we are seeking help of web masters and authors of popular web sites (like your site, drunkenblog.com which was linked off slashdot recently [Jan 8 '05]) to provide us with web logs. We are specifically seeking:
1. Long term logs that have at least one (possibly many) flash crowd that was referred by either a single source or multiple sources (the later strongly desired, if possible)
2. We would like to analyze the normal traffic on the web site (i.e., months or weeks prior to the flash crowd), obviously the direct effect of the flash crowd, and the long-term effects (months or weeks following a flash crowd). We would be grateful if you could provide us with the most extensive logs possible.We also understand that this request might not settle in easily on you. We are ready to provide certain guarantees. We can list the obvious ones below and are open to any further restrictions:
1. We intend to keep the web logs private (unless you explicitly mention otherwise). The only people who will have access to these logs will be the researchers involved in this project (Dr. Gupta, Charles, and I).
2. We will not reveal the name, URL, or any specific traffic and non-traffic related characteristics of your site in any direct form.
3. We will fully acknowledge your help (i.e., "plug" your web site) and generosity in any publication that might result as part of our research.
4. For the privacy conscious, we can provide a sanitizing script to preserve the privacy of your users and resources, if you so desire.
5. We can provide a uiuc.edu FTP site for you to upload your logs.
6. We are willing to bide by additional clauses.We hope that you are willing to help us in the spirit of extending academic research. We would be exceedingly grateful if you could work with us.
Regards,
Jay Patel
Thing is, it seems pretty legit, and I'm quite inclined to help them by handing over the logs. There are a few weird things about DrunkenBlog in this respect, though:
- I don't really know how 'popular' DrunkenBlog really is, at least in general terms. I've come to learn that a surprisingly amount of people know about it, but they're generally pretty plugged in and not necessarily your average Joe. I.E., the vast majority of regular readers to the site are those who are into things like RSS... and I'm apparently really (in)famous at Apple, but the front page isn't really seen that much.
- The traffic patterns for DrunkenBlog would probably drive a researcher mad, as sometimes I'll go a few weeks without posting, and sometimes there'll be ten posts in one day. Some days it's 150,000 page views and some days it's 1,000. Some days I'm posting on threading and funnels in the OS X kernel, and some days I'm posting about nude students overseas. All part of my plans to lull you into a false sense of security before the next big thing...
- I don't really keep log files, as much fun as it can be to geek out on them... doesn't seem like a good idea in today's society to hold onto them. I generally keep two weeks worth, and that includes the current week. I even have a script to go through and 'sanitize' IP addresses from comments posted as anonymous after a day or so, because I figure that while some post as anon for convenience and such, others have valid reasons for not wanting things to come back to them. If they aren't going after someone, no reason to keep it.
The particular log file he was after was already gone, but I do have a nice big fat one from the last few days, plus the last week, that could be passed on.
I've gone through their paper, and it's pretty interesting stuff. They've worked on a technology called 'Overhaul', with Apache as their test bed (via mod_overhaul), that implements an extension to the HTTP protocol that is almost creepily like a swarm client -- aka, bittorrent -- in order to handle a flash-mob of traffic.
The idea is that when a site gets rapidly bombarded by a stampede of client requests, the server slips into 'Overhaul Mode' and breaks the document/file/image into multiple small chunks and distributes those to the clients requesting the file. The clients then connect to each other (via DHTTP) to get the data they need to form a complete document, which offloads much of the file-transferring work from the server and allows it to stay responsive to the requests coming in.
Think about it this way. Let's say something on DrunkenBlog gets listed on several high-traffic sites in an overlapping manner. Each client browser connects to the server to download the HTML page, and then render it. A normal page might have several files -- the html of the page, some images, etc. If you have keepalives on, the client stays connected to the server while it snatches those in quick succession -- if you have them off, the client has to reconnect for each one.
Seems smart to automatically keep keepalives on, but consider that you generally need to limit how many instances of apache are running so that when thousand of people are trying to connect at once, a thousand instances don't get spawned, which would kill the server. If a bunch of people with modems are tying up an apache process... anyways, gone into all this before.
Long and short, things are either going to get really slow or you'll starve clients out -- the server just won't be able to deal with their requests before the browser decides the server has gone stupid.
With Overhaul, when this happened the server would kick into a different mode and break up the text of that page that has gone hyper-popular, and whatever files and images that are on it, into small chunks and serve those out. So if 10,000 clients were all requesting the page, they'd each only get a small chunk and then talk amongst themselves in order to get all the chunks they needed for the browser to render the page... all depending upon how overloaded the server was.
If a server would normally start to just die at serving 10,000 clients in a five minute period, this might allow it to be responsive to much, much more traffic than that as while the clients can pull chunks from the server, they're pulling chunks from all the other clients connecting...
It's a really neat idea that I'm boiling down in a way that probably doesn't do it justice, but suffice to say I'm fascinated and really inclined to try to help them out.
However, while it's my data, my gut says I should ask my readers if they'd have any problems with it. While I may have the right to do what I want with it, it doesn't mean it's something I should do...
Assuming I got the scripts to sanitize the logs, and they seemed to be legit in filtering out identifying information, would any of you have a problem with it?
Comments (20)
Posted by: ick_Filter at February 27, 2005 12:54 AM
I don't really know how 'popular' DrunkenBlog really is, at least in general terms.
What's with the false modesty? I think you're a blight on the Apple community, but I admit you're known.
Assuming I got the scripts to sanitize the logs, and they seemed to be legit in filtering out identifying information, would any of you have a problem with it?
Why even ask? It's [i]your[/i] data and anyone who wants to keep their information private should know to use a proxy while browsing.
Posted by: Noah Slater at February 27, 2005 01:27 AM
Seems like a very valid cause. No problem here.
Posted by: Maxim at February 27, 2005 01:32 AM
How is this different from the Coral cache thing everyone talks about?
And yes if it is anonymized somehow I don't care
Posted by: Cap'n Hector at February 27, 2005 01:40 AM
Oh, no! Something like this might reveal the fact that I surf your site!
That's not OK with me!
Oh, wait…my comments show that.
Curses. ;-)
Posted by: kreger at February 27, 2005 05:03 AM
cool, a buddy of mine were talking about something similar over way too many beers. ours had to do with google cache.
i saw go for it. expose us to how many apple employees read you site.
Posted by: Magnes at February 27, 2005 07:24 AM
With Overhaul, when this happened the server would kick into a different mode and break up the text of that page that has gone hyper-popular, and whatever files and images that are on it, into small chunks and serve those out
Hey drunken, wouldn't a system like this solve many of the scaling problems seen with RSS? Or at least diffuse them?
Magnes
Posted by: Kevin Ballard at February 27, 2005 10:52 AM
Sounds interesting. Go for it.
Posted by: Ed Gordon at February 27, 2005 12:05 PM
Seems pretty fair. I think you are even being a little over-cautious (not to a fault), as I can't think of any content or comments that have appeared on your site that approache the level of sensitivity as that for thinksecret or lokitorrent.
Posted by: David Magda at February 27, 2005 12:19 PM
The researcher may also want to ask a webmaster of the Kernel Thread site [1] (Amit Singh): he's been Slashdotted about ten times now (8 occurences in 2004).
[1] http://kernelthread.com/
Posted by: Ben Donley at February 27, 2005 01:13 PM
Uh, what are you worried about? It's not like you're dissident news. I can't think of a single problem that could come of this, sanitized IP addresses or not.
Posted by: Jason Terhorst at February 27, 2005 01:48 PM
Anything to help a student. I'm a student myself, and I appreciate when people are willing to help with a big project. This one sounds like a semester-long monster work.
Posted by: Brian Schack at February 27, 2005 03:07 PM
I don't know how practical this is. What if you served a full text RSS feed through Coral Cache?
A few possible problems:
- It might take too long to update.
- People use the Coral Cache URL to figure out the uncached URL and then suck up all your bandwidth.
Posted by: drunkenbatman at February 27, 2005 03:07 PM
Uh, what are you worried about? It's not like you're dissident news.
It is about expectations and disclosure. People may not care that the log files are shared (especially if properly sanitized) but being asked for thoughts before hand, but many may have the expectation that they wouldn't be and be unpleasantly surprise (and have questions about it) if they were.
This isn't so much about "Yeah, go ahead" as "I don't have a problem, but could you..."
I can't think of a single problem that could come of this, sanitized IP addresses or not.
In terms of someone getting ahold of the info and using it in weird ways, I'd agree, which is why I'm looking at doing it. In terms of my relationship (and trust) with my readers, I can see a lot of problems that could come up, and I value those things.
Posted by: Jay A. Patel at February 27, 2005 03:38 PM
Thanks for posting this. I think it is a great idea to let your readers know about your intentions. It seems that most of your users are OK with this. Well, good for us =)
Anyhow, we're working on fine-tuning Overhaul with the data we've been gathering over the past few weeks. We're also looking into other interesting properties of "hyper-content" and how it can (if it can!) be better delivered using distributed protocols, especially coupled in with the rising popularity of blogs. With that said, I am going to ask for additional help from any readers of this blog: if you can help us out with additional logs (even from a low traffic web site), please do not refrain from contacting me (e-mail address available on my web site).
Some notes for previous commentors:
Coral works. It is one of the better solutions. However, it can not be a universally accepted solution because of its reliance of others. Coral runs on PlanetLab (which is a research deployement) and can be yanked anytime in the future if the administrators of PlanetLab feel that Coral is too resource consuming (for example, it it becomes more popular).
I did contact Amit from KernelThread. He lost his archived logs (and does not really keep logs, anyway).
We are looking at efficiently delivering RSS feeds.
Posted by: Skatch at February 27, 2005 10:18 PM
Two things:
1) This seems like a very neat idea, and an interesting area of research. Important for keeping those without lots of resources to devote to webhosting (i.e. independent voices) viable on the web.
2) As others have said, go for it.
Good luck Jay.
Posted by: Ben Donley at February 28, 2005 01:16 AM
In terms of my relationship (and trust) with my readers, I can see a lot of problems that could come up, and I value those things.Fair enough. Lord knows these "flash mobs" are perfectly capable of a PR lynching.
Posted by: Brian Donovan at February 28, 2005 02:00 PM
Go right ahead. Sounds like a really cool idea.
Posted by: Ben at February 28, 2005 05:45 PM
It's your data. Do it if you want to. But I do appreciate your asking.
Posted by: primary0 at March 1, 2005 05:32 AM
give the logs, you will be contributing to a good project :)








No problem here...