Speeding up data store queries with self-merge joins

Do you like this post? Writing posts, developing libraries and programming online font creators is actually hard work. I order most of my electronics from a Chinese website called Banggood. It is save and the items usually arrive in good quality. With every order you do there by following this link you are supporting the blog. Thank you!

For a while now Feederator allows some kind of full text queries. When items are read for the first time into the datastore an index will be created. This index is currently a break down of the single words in the title, but can be extended to the description and other attributes of an item. My first attempt of implementing the full text search on the AppEngine resulted in a very costly algorithm.
The fulltext search index is stored as a set of strings. The datastore allows you to query for items on a set as if it would be a singular attribute, like a single String for example. So you can create a query for a set of strings index={“green”, “yellow”, “red”} in this way WHERE index == “red” which would be a hit in this case. I tried to do a query like this WHERE index == “red” AND index == “green” order by date which returned with errors telling me that I need to create a datastore index for that query. I thought that the error came from the fact that I had to filters on the attribute index. But I was wrong. The problem was that I wanted to sort the results at the same time. So, by forgetting about the in-database-sorting the query works and it works darn fast. And even better: you don’t even need an index for AND connected filters like that! Which means you can save storage space (in my case almost 1GB!).

Here are some rules of thumb that I learned from this lesson. Due to the way the datastore is organized (don’t forget its BigTable nature!) you have to pay attention when designing your data model.

  1. You can query easily on sets
  2. Try to avoid non-equal filters or filters that use OR. The best filters are ones that try to find an intersection between different domains on ONE entity
  3. Sort your results in-memory to avoid the need for indexes
  4. Sets are expensive to serialize/ deserialize. This costs occure when you write or read a set from datastore. To avoid serialization costs put your full text index on a separate child object. You query on the child object only for the key. With the key you get the parent object and then you can do in-memory sorting/ ranking
These links helped me a lot to understand how to optimize my full text search:
Posted by squix78

Leave a Reply