Exporting your App Engine Data with a Sitemap of Sitemaps

Currently there are about 200’000 archived items available in the Feederator database. In the hope to attract more traffic I wanted to make this data available to search engines. But how to make this data available to search engines without putting too much burden on my App Engine Quota. Certainly one big Sitemap wouldn’t work, because the call from a search engine would reach the 30 seconds limit of a Servlet call long before all the links to the different items could be created.
Ok, that means breaking it down into smaller chunks. But how to distribute the data more or less equally? That was an easy one: since Feederator adds more or less the same quantity of new feed items to the archive I would have to separate the chunks by day. Every item already has a publishedDate, which marks the date of entry to the datastore. But that would be another problem: how to add all the different sitemaps to search engines like Google or Bing?
Turns out there’s a solution for exactly that: a sitemap of sitemaps. The super sitemap file has to look as described here. So this file would contain one entry per day and could be generated relatively easy by iterating over all days in the last three months (the duration items are stored in the archive). I wouldn’t even have to know how many items are available per day in this stage. Just assume there are some, that’s enough.
This looks like this:

<?xml version=”1.0″ encoding=”UTF-8″?>
<sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<sitemap>
<loc>http://www.feederator.org/sitemap/2010-12-19</loc>
</sitemap>
<sitemap>
<loc>http://www.feederator.org/sitemap/2010-12-20</loc>
</sitemap>
<sitemap>
<loc>http://www.feederator.org/sitemap/2010-12-21</loc>
</sitemap>

</sitemapindex>

In my case, the same servlet that creates the sitemapindex file also creates different per-date index files. To avoid huge quantities of XML overhead you can export the per-date sitemaps also in line by line format. The sitemap file for 2010-12-19 looks like this:

http://www.feederator.org/item/agpmZWVkZXJhdG9ychYLEg1GZWVkSXRlbU1vZGVsGLfUvwIM

http://www.feederator.org/item/agpmZWVkZXJhdG9ychYLEg1GZWVkSXRlbU1vZGVsGLjUvwIM

http://www.feederator.org/item/agpmZWVkZXJhdG9ychYLEg1GZWVkSXRlbU1vZGVsGLvUvwIM

http://www.feederator.org/item/agpmZWVkZXJhdG9ychYLEg1GZWVkSXRlbU1vZGVsGMHUvwIM

http://www.feederator.org/item/agpmZWVkZXJhdG9ychYLEg1GZWVkSXRlbU1vZGVsGNbUvwIM

http://www.feederator.org/item/agpmZWVkZXJhdG9ychYLEg1GZWVkSXRlbU1vZGVsGOTUvwIM

With one link per line. To write a servlet like that is pretty easy. 

Posted by Daniel Eichhorn

Daniel Eichhorn is a software engineer and an enthusiastic maker. He loves working on projects related to the Internet of Things, electronics, and embedded software. He owns two 3D printers: a Creality Ender 3 V2 and an Elegoo Mars 3. In 2018, he co-founded ThingPulse along with Marcel Stör. Together, they develop IoT hardware and distribute it to various locations around the world.

Leave a Reply