Regarding the full text feeds

rss iconI mentioned awhile ago I was working on full-text feeds for the blog, or rather evaluating them, as when it comes to 'features' for DrunkenBlog the two most highly requested are full-feeds and/or a mobile feed. On average, I get about 5 emails a week asking for full feeds, and they're starting to pile up... so I wanted to give an update on what's going on behind the scenes.

If you haven't followed RSS at all, this won't mean a lot to you, but I have been listening and working on full feeds, but they've been somewhat on the back burner while I'm working on something really big for the blog as well as more chats, but with people donating they got pushed closer to the front of the stove and because of the amount of people asking.

There are real reasons to want a full-text feeds. At the moment, aggregators primarily just pull the feeds down and display them in a central place: next up we'll have searching the feeds for the info you want, and I hope to God someone is working on baysean system (similar to how Mail clients recognize spam) that gives you a better idea of what is going to be up your alley, as it's really easy to pull in more feeds than you can reasonably get to, and at that point you need something running triage for you.

In order for these things to be really effective, they need as much information as possible, and a title (especially when they're as random as mine) along with a short excerpt hurts their effectiveness. If your RSS reader supports caching, and you have full feeds, you can suck down your feeds on your laptop and read through what you want while you're offline.

If you are a mobile user, it's easier to fit in just the content on your screen rather than the whole sites chrome, advertising, etc. Even if you're not offline while reading, you're still able to save clicks by not having to open browser windows no matter which client you're using, even if it's a web-based one. Sometimes it can be hard to tell if a story is really up your alley from the excerpt until you pull up the full story.

I do recognize the benefits of full-feeds, and like I said I've been listening. Unfortunately I have hit some issues... primarily related to bandwidth, load and formatting.

Load and bandwidth

To give you an idea, once you factor out the spider-bot noise (like google indexing the site, and I'm being overzealous in what I cut out in order to lowball the figures) there are a hair over 5,200 unique IP addresses pulling down the DrunkenBlog RSS feed every week. This does not mean there are 5,200 people pulling down the feed, as someone might pull down the feed on their laptop at work, with one IP address, then another at home. They may have a broadband connection whose IP address changes every once in awhile, never, or every time they dial-in (modem).

It starts getting weirder when you talk about people using services like Bloglines, where Bloglines gets the feed and then others view it through their service. According to Bloglines, there are 110+ subscribers on their service, but they only show up as a few IPs. And there are a lot like that, although the vast majority of mine are using aggregators on OSX, Linux or Windows. And then not everyone checks their feeds every week, and we're not even going to talk about the people not even using RSS yet... let alone the people who subscribe to any feed that comes their way but aren't really loyal readers.

Without spending time on more accurate tracking, we're talking about the equivalent of exit polling. I.E., within the realm of possibility but don't bet the farm on it. If I had to guess about actual individuals pulling down the feed, I'd want to undershoot and shave off 25% from the weekly 5,200 and assume 4,000 unique people pulling down the feed. There's a reason why that's good to keep in mind, but the actual amount of requests is important to.

Now, remember, these are people pulling down the feed through a reader (not just clicking on it from their browser or other noise), not page views... we're just talking about people pulling down the RSS feed, and a lot of people haven't gotten into it yet, although a lot of my readers have. We're not talking earth-shattering numbers here, but they're notable for such an odd and relatively young site.

The current RSS feed (which includes the link, title and excerpt) averages 10 to 11 kilobytes in size. If that was pulled down every request, that'd be ~210-230 megabytes of bandwidth each month... just for the normal feed.

Luckily, there are some saving graces there. Many others' feeds are larger, but I keep mine to the last 15 entries and I've resisted just posting everything I read that I find interesting until I have the time to setup a proper link list or something, as that brings in other issues.

Something you would have gotten from the RSS for Mac OS X Roundtable was HTTP 304 error codes, which tells the reader that the feed has not been modified, and if the reader supports it, it doesn't get downloaded. These requests involve little server overhead to deal with, and much bandwidth gets saved because I don't just link to everything and anything I find interesting at the moment.

As it stands, because the big booms of traffic for DrunkenBlog are so sporadic, RSS-related bandwidth is approximately 7-10% of the whole on a normal month once everything is averaged out.

However, things start getting a little more interesting when you look at the full-text feeds I've been implementing. As an example, a very popular format for RSS feeds would be full-text + comments. This is really, really nice, as it lets you not read the full text of the post, but also read the comments posted by others when they're added, etc.

The RSS v2.0 full-text+comments feed for the last 15 entries (before this post ;) is ~430 kilobytes. Now, this is well over what other sites would see, because a lot of my posts are very, very large compared to other sites, and some of them get a lot of comments when readers are particularly interested in the subject.

I recognize DrunkenBlog is an aberration of sorts, as its feed includes things like my own massive posts along with the Growl Chat and RSS Roundtable, not little blurbs and news from the day. And remember, while this file size could easily go down, it could easily get much, much larger the next time I piss off a large group of users. While I was testing it around the Convergence Kills timeframe, I was seeing 650k sizes. 5,000 people downloading 650k just one time is 3.25 gigs.

There are posts on DrunkenBlog with one comment, posts with 50 comments, and posts with 125 comments... and many of those comments are huge. I'm mentioning this because while the current feed is a decent average of what the average for the blog is, longtime readers will know it could be much, much larger very easily, and because I don't throw in a lot of 'filler', the really big things can run together.

Dropping the comments on the feed for the last 15 entries, and just tacking on the count of them at the site at the end of the feed helps, and just requires someone to click and open their browser to see others' comments. Not as convenient, but it shaves the current full feed from 430k to 325k. That's a nice 25% savings, but still a factor of 32 over the current 10k feed.

When you want to cut down on bandwidth, this is normally where something like gzip compression through Apache comes into play, and I've been testing that quite a bit. It knocks the size down from 430k to around 130k, or a savings of 70%. This is a solid reduction in size, even though it's still a factor of 13 over current feed, and bandwidth is still a real problem, or rather more than I'm comfortable with.

Unfortunately, compression brings in it's own problems, namely load on the server and speed. DrunkenBlog usually stays pretty fast for getting your feeds and viewing through some major traffic... slashdottings, you name it, sometimes all at once. I go into how and why regarding this is in a prior post, but much of it involves allowing headroom and keeping average load as low as possible, and trying to keep things as low-rent as possible so everything can scale without dying.

Compression is nice, as the idea of sucking down a 400-500k feed can seem a little wiggy for many, so if you can get it down to 150-200k, that's much easier to swallow. However, the difference between sucking that down a high speed connection versus a modem is pretty drastic: a few seconds versus a few minutes.

I don't have hard statistics on readers connection speed, but I can capture data such as how long it takes most people to connect and get what they want. I.E., if I know it takes a DSL connection a certain amount of time to download a page and disconnect, if 70% of them are beating that time, I can be reasonably sure 70% of the people accessing the blog are geeky enough to have high-speed connections of some form. (as a side note, the majority of those who aren't on broadband are overseas)

When someone is on a modem, compression speeds things up in a big way. You take a hit in processor cycles on the server side and the client side while it's compressed and decompressed, but their connection is so constrained that overall it's faster. The case isn't nearly as clear when it comes to broadband connections, and in some cases it actually makes things slower even though bandwidth is saved.

With a feed that is so much larger, the combination of compression along with the existing HTTP 304 error codes when the feed hasn't changed seems like an obvious winner. However, even excluding the speed factor for compression, there is the load factor. This has been really difficult to quantify, and could use more testing. But with the testing I have done, and extrapolating out from the existing stats of how the feeds work, I'm not happy with what I'm seeing.

I'm going to keep working on it as I can, but I don't think I'm going to be able to offer full-text feeds at the moment to those who haven't donated to ease the load.

And just to make it clear, I do not have my hand out here, and would like to slap them up full feeds as I love my readers. But remember, I bring approximately zero from the site. Yes, there are some Google ads, but I moved those way out of the way because it annoyed me that people had to see them before they saw the chat. They're so far below the fold that on a normal-sized post, most don't ever see them. And it doesn't help that much readers are often geeky enough that they mentally tune out ads or have them blocked altogether...

In checking the last stats, the google ads made around $1.43 cents over the last month. When you have an idea of how much traffic DrunkenBlog pulls in, even while sporadic, that might seem a little weird. However, different categories of ads pay better than others, and different topics the ads are geared towards pay better... and I specifically avoid giving into that, and might just remove them altogether, as in order to make them work better I'd have to make them more intrusive, and I moved them out of the way for a reason.

I do love that DrunkenBlog has a lot of readers; I value them, and it's really neat to get an email from the Growl project letting me know that prior to the chat they'd had 2,500 total downloads for the software, and within a week of the chat they blew past 7,500. That's beyond cool, and one of the reasons DrunkenBlog is worthwhile to me. I'm certainly not doing it for the fame, otherwise well, I'd use my real name.

However in order to keep it going while trying to do cool things like the chats, and some other things I'm trying to do for the near future, I need to keep it as low-rent as possible to as little as possible is actually coming out of my pocket. I'll honestly keep evaluating it, but the increase in load and bandwidth is just more than I'd be comfortable with on my dime.

Formatting and CSS

This one is turning into one hell of a headache, and basically boils to anything but plain text looking like a little messed up in the full feed which I've handed out to donors.

For example, this is what the Growl chat looks like on the site under most browsers:

normal format

And this is what it looks like when viewed within the full feeds:

messy format

The obvious problem is that the Growl icon isn't over to the right with the text flowing around it. Here is what you see when you see farther down the full feed of the Growl chat:

really messy

Again, the images aren't where they're supposed to be, and look pretty nasty. You'll also notice that the 'question' isn't colored, which it should be.

But then the Deja Drunken post looks just fine:

really messy

The image in the Deja Drunken post is literally just slapped in, and displays just fine. But the other images are positioned via CSS, so you're able to put it where you want and get the text to flow around it. The color is also defined via CSS.

With something like RSS 2.0, you're able to tell the aggregator "Take this stream of data and interpret it" which works great and you're able to get HTML and other spiffy things. And they do display CSS just fine, but only if include the CSS within the page. I.E., when adding the code for the image, you put the CSS behaviors you want for it with it, such as floating to the right with a margin, and those should work just fine.

Unfortunately, normally you you just specify a class for the image, and import an external CSS file... this is about what everyone does, and one of the most important aspects of CSS. By just specifying a class, I'm able to change that one file and change the behavior of those images and the color of the questions or well, anything without having to go through the whole site manually. It's um, kind of the whole point.

You could say, "Hey, the images aren't actually necessary, they're just nice... and you could just make the questions be in italic so someone reading the full feed can tell they're a question so it doesn't all blur together quite so much", and there's something to that.

However, if I was going to make the questions be in italics, I'd specify that in the CSS, as that's the whole freak'ing point of using CSS for it in the first place. Yes, I could wrap the class in the HTML tags for italics, but it's just so sub-optimal that there has to be a better way. If I add an image, I can just wrap it with the CSS info, but again, one of the major points of CSS is that the browser can just load that external file once and cache it, let alone saving the author time.

There has to be a better way, I just can't seem to find it... I have found a way to specify a CSS file or another type of file which you can link to, but these are more about formatting the XML data in the RDF feed itself, not the content the feed is holding. I.E., I can't seem to find any way to tell aggregators "Use this external CSS file for displaying these classes".

I can't be the only one hitting this, so I'm probably just missing something. If you have an ideas, please feel free to fill me in.

yummy alcohol posted button Posted by drunkenbatman
    November 10, 2004, at 08:24 AM


Comments (19)




Post a comment



Anonymous comments are allowed, but please enter something for a name.

And do endeavor to appear sane.









Remember personal info?