Development

All about searching

Search is a big part of Discover, but it’s more complex than you might think. There are many ways of implementing search systems, so lets have a look at what’s involved.

Exact matches

Like its name suggests, a search using exact matches will only find things that match the search term exactly, so if you have a record containing “hello world“, it will only be found by your search if you search for “hello world“; searching for “hello“, “world” (or even “HELLO WORLD” if the search is case-sensitive) will not match it. This kind of search can be very fast, but it rapidly becomes less useful the longer your search term and search targets are, as the slightest difference will result in not matching. However, it remains very useful for things like searching for tags, where the search terms and targets are both very small and thus much more likely to match in useful ways.

“Contains” matches

Searching for fragments of text in a larger target are very useful. Using our previous example, “llo wo” will match our record. This kind of search is quite inefficient, but it remains a perfectly usable basis for searching collections of several thousands of records simply because database servers are pretty good at it, however, it’s not scalable and you will eventually hit performance limits.

Indexed matches

As a strategy to head off the performance problems inherent in “contains” searches, you can give the database a head start by building indexes built from words or phrases within the data. Say we have these records:

  1. ballet dancer
  2. tap dancer
  3. kitchen tap

From these we can pre-build an index that might look like this

  • ballet: 1
  • dancer: 1, 2
  • tap: 2, 3
  • kitchen: 3

now if we search for “dancer“, we can first look in the index, which tells us that this word occurs in records 1 and 2 – but notice that we got to that result without having to look at the contents of those records at all, only the index. If we search for “ballet dancer“, we can look up each word in the index separately, and then extract the results that contain the same index values. This could also match words in a different order, or where there is perhaps intervening punctuation. This is great news for performance, but adds some complexity as we need to make sure that the index stays up to date when we change our data, and it needs additional storage space and memory. This approach is also sometimes called “full-text indexing”.

This approach also opens up the possibility of including partial matches in our results, giving us a rating of how good a match is – so for our “ballet dancer” search we would have a score of 2 for the first record (as it matches both words we searched for), but the “tap dancer” record would get a partial match with a score of 1, so we could perhaps list that lower-scoring match after the better matches – this is called ranking.

Fundamentally though, we are still doing exact matches, just in a different place, and so it has some limitations because of that.

Stemming

While our indexed search finds “dancer“, it would fail to find “dancers“, “dance“, “dances“, or “dancing“, which are likely to be of interest too. This can be improved by processing the search terms before we look them up. Stemming is a means of “normalising” search terms so that they match more things that are likely to be of interest, but it’s often quite language-dependent. We can use a dictionary of common words and grammar rules to help us map all the likely variants of a word into a standardised base term – so we could have a map like:

dance, danced, dancer, dances, dancing: dance

and then have an index that points at all the records containing “dance“, and a search would then result in matching any of those variants even though our index has not got any bigger.

We can do other operations at this point too, in particular spell check. We had a good example of this in Discover where text extracted from an image had found “Fenguin”, and yet searching for “Penguin” would find that record. Note that this can apply to both the search terms and the index that is created, catching spelling mistakes in search terms and stored records alike.

Synonyms

You can use a similar approach to stemming to find words similar in meaning even if they are completely different lexicographically – my thesaurus suggests “caper, cavort, frisk, frolic, skip, prance, romp, gambol, jig, bound, leap, jump, spring, bob, hop, trip, bounce” as alternatives to “dance”, and we could map all of those into the same index to make matches more likely.

You might have noticed that all of these strategies are aimed at increasing the numbers of matched records, and that can get counterproductive in much the same way as overly specific searches can.

Proximity

We mentioned scoring results so that we could rank them, but the mere presence of matching terms is not the only thing that can be taken into account when ranking. Proximity to other words and ordering can also contribute to the score, so while we might give “ballet dancer” top marks, we might want “dancer ballet” to receive a lower score, or a sentence in which the words are further apart to be considered a match, but with a lower score. For example “ballet is performed by dancers” contains both the words we’re looking for, but not next to each other.

Logical operators

A common addition to search systems is to allow you to merge the results of multiple searches into one result set, applying logical rules as you do so. That’s mostly a fancy name for “and“, “or“, and “not“. By default, most searching systems default to “and”, so the more search terms you add, the smaller the result set, as it becomes more and more specific. We have seen how this can work in our indexed search examples, but it’s not clear how we could search for both ballet and tap dancing because none of the records match all three words at once. A logical search would be something like:

(ballet OR tap) AND dancer

That would search for records that contained “ballet” or “tap“, then find all the records that contained “dancer“, and finally show us only the records that appear in both sets. That syntax with explicit AND and OR operators and brackets is cumbersome and unnatural, so it’s common to use a “natural language” preprocessor that maps common ways of phrasing things into those stricter terms internally.

Document searching

So far we have been looking at processes that take input from a person entering short search terms by hand, and delivering a set of results, which are typically documents of some kind. That’s useful, but there are other kinds of searches that don’t work that way, in particular “more like this”. In this scenario, using our conventional search inputs, we might one particular thing of interest, but what we’d like to find now is more things that are very similar to this result. That’s a very different approach – it’s effectively making our search terms enormous, and doesn’t fit very well into the approaches we have seen so far, so there are systems that are dedicated to facilitating this kind of matching.

Image searching

Of course you might want to search for things that are not text, in particular images, and “more like this” for images be seen working very well on Google Images and TinEye. Image search works in a very different way to text searching, and I won’t go into it here, but if you want to find out more, the term to look for is “perceptual hashing”! Ultimately though, an image search should be able to produce a ranked list of records that match our input search image, so the two systems can easily work together.

So what’s a search engine?

Looking back at the start, most general-purpose database systems in common use (e.g. MySQL) allow basic searching, providing exact and partial matches, and sometimes full-text indexing with basic ranking, but that’s generally where they stop. If we want to take on additional measures including stemming, synonyms, logical operators, natural language preprocessors, document searching and so on, we need to take a step up to systems specifically designed for this kind of search, and these are what we call search engines.

Search in Discover

Searching in Discover uses two kind of searches: simple exact match searching for tags, and a comprehensive search engine for full text searching. Discover’s database is built with the aforementioned MySQL, and is used directly for tag searches, but rather than building or hosting our own search engine, we use an external service called Algolia, which is straightforward (if you’re a developer!) to tack onto the back-end of a web application to add much more comprehensive search capabilities.

Search results from DCD Discover

Leave a Reply

Your email address will not be published.