Ok, that means breaking it down into smaller chunks. But how to distribute the data more or less equally? That was an easy one: since Feederator adds more or less the same quantity of new feed items to the archive I would have to separate the chunks by day. Every item already has a publishedDate, which marks the date of entry to the datastore. But that would be another problem: how to add all the different sitemaps to search engines like Google or Bing?
Turns out there’s a solution for exactly that: a sitemap of sitemaps. The super sitemap file has to look as described here. So this file would contain one entry per day and could be generated relatively easy by iterating over all days in the last three months (the duration items are stored in the archive). I wouldn’t even have to know how many items are available per day in this stage. Just assume there are some, that’s enough.
This looks like this:
<?xml version=”1.0″ encoding=”UTF-8″?>
In my case, the same servlet that creates the sitemapindex file also creates different per-date index files. To avoid huge quantities of XML overhead you can export the per-date sitemaps also in line by line format. The sitemap file for 2010-12-19 looks like this:
With one link per line. To write a servlet like that is pretty easy.
A ridiculous amount of coffee was consumed in the process of building this project. Add some fuel if you'd like to keep me going!