Big Systems 101: Spec'ing it Out

This is the first item in what I hope to become a series covering one of the largest systems I've ever been involved in building. A full year has passed since it was put into production so now I can give a level of detail which wasn't possible before. Yes, this is all written after the fact which gives insight not available at the time… oh well.

Alright, the first thing you need with any new system is a set of requirements. Some people want to skip this step and dive directly into the design or even coding but hopefully those people will be tied up and left in a closet by themselves somewhere. Just like you can't buy plane tickets without a destination, you can start system design or implementation without a destination. So here we go…

Let's build an RSS reader (r1). No, not something to compete with Bloglines, Thunderbird, whatever. Instead, let's build an RSS reader which can retrieve news from all of the major primary news sources. We're going to skip the New York Times, CNN, or FoxNews and go directly to the Associated Press, LexisNexis, etc. Alright, so although this doesn't say it explicitly, it does imply that we're going to have to deal with huge volumes of information coming all day and night (r2). Unfortunately, since requesting new items every minute might annoy them, let's throttle our requests (r3).

Now for some assumptions… we can assume that articles will have updates throughout their lifetime due to new and better information becoming available (a1). Well, there is an upside of using the AP instead of the secondary sources. We don't have to worry about getting the same story from different sources (a2). Let's also assume that every feed can be retrieved by a simple http request (a3) and that since many websites simply syndicate this content, it may have html in it (a4). And of course, we've skipped the biggest assumption of all… that we're actually getting RSS (a5).

So the summary of our requirements are:
(r1) An RSS reader importing primary news sources;
(r2) We have to support huge volumes of content being made available constantly;
(r3) We should throttle our requests so as not to annoy the powers that be.

And the summary of our assumptions are:
(a1) Individual articles can be updated multiple times as corrections/new filings happen;
(a2) Any given item will only appear in one source;
(a3) Any feed should be retrievable via http;
(a4) The content items may have html in them;
(a5) The feeds will conform to RSS.

Alright, so I think we're ready to sketch out an initial database design. We need two tables, one to hold the list of feeds and one to hold the content from those feeds. On the Newsfeeds table, we'll start with these fields:

id (primary key)

And for our Newsitems table, we'll start with these:

id (primary key)
feed_id (foreign key)

Remember, now that we have a bit of design, we will regularly have to make the decision to update this or dump it and begin again. Depending on how much our two lists change, updating this design could get messy. And that's when it gets fun…