Building a Simple Search Engine

I recently wrote a small search engine to find stock symbols and found the exercise to be challenging and rewarding. I wrote most of the search functionality in PHP which, if you’re new to programming, is a good place to start. It is integrates smoothly with HTML, and it also has a readable, common-sense syntax.

The raw data consisted of three fields: The ticker symbol, the security name and the exchange.

I loaded this into the database with code rather than externally through any admin functions. Why? The fact is I struggled with the interface for a while (GoDaddy) and couldn’t get the data to load properly. Since it was not a big job anyway, I just wrote a small program to do it. For each row of data entered into the database it printed a confirmation line. In fact this served as a good warm-up to make sure I had a handle on database connections and basic MySQL code.

This was actually quite fun! I’d never loaded a large database onto an external server, and found it amazing to see how fast all of the data loaded. All 9000+ symbols loaded in less than 15 seconds.

Once the data was loaded, the question was: How do I get from a raw database full of stock names and symbols into a real search engine?

Well the first thing I had to do was to create an input box to get the user query.

So I started with a super-simple HTML page with a bare-bones user form with the kind of familiar input box used for anything from a Google search to a captcha confirmation. 꽁머니 It takes the user information and calls another page called process_query.php.

Most of the “action” was to take place there.

Now, unlike the homepage which has the familiar html extension, the process_query page has the php extension. This essentially alerts the server (the computer which ‘serves’ the pages to your browser) to expect special php code.

Php sits on the page along with HTML code. Roughly speaking php extends the functionality of html, though of course is a fully functional language in its own right. A page with the php extension will properly display code written entirely in html. However with the php extension you have all of the power of php waiting at your fingertips. All you have to do is invoke to special compiler using the proper opening tag.

This is where all the fun starts!

Once the form from the home page invoked the process_query page, I had to deal with converting the raw values entered by the user into something the server could use. It turns out that this point of interaction between human and server can serve as a launch point for a lot of bugs and security issues. So I included several functions to scrub the values and make them safe and palatable to the server.

But now came the hard part.

I had to query the MySQL database for a whatever the user entered into the input box. I had no MySQL experience and found relational database discussions arcane and incomprehensible. So simplicity was going to be key. I took a deep breath, squared my shoulders, and focused on just querying the database for a single search term. Obviously not my ultimate goal, but, you gotta walk before you can run, as they say.

Honestly I figured it would take me a day or two of internet searching and code trials just to do this but, amazingly: It took about 10 minutes!

That’s right. Ten minutes.

Why?

It turns out that querying the database for a single field is the MySQL equivalent of a child learning to say “Da Da”.

Within a half hour I had expanded it to a two element search and was happily searching and retrieving stock symbols out of the database.

This worked well at getting a list of somewhat screened results but it vastly increased the number of results (since I used the OR operator) and put all the more pressure on the next phase of the operation: sorting the results according to relevance.

Now I can’t speak for the trials and tribulations of large search engine builders, but at the relatively small scale of a stock ticker search, relevance sortation is a much bigger task than simply pulling the items out of the database.

What makes it so hard?

First of all you have the relevance level depending on the positioning of a given character group within a word. If a user types “sil” we would naturally assume the user was looking for a word which began with those letters. It is more likely the user was looking for “silver” or “silicon” than “fossil” or “intersil”.

So we need to increase the relevance score for matches in proportion to their proximity to the front of the word.

But we also have to consider the importance of a given word match within the context of the whole security name. For example, if the user types in “gold” we want “Gold Resource Corp.” to rank higher than “First Eagle SoGen Gold Fund”. The assumptions are (a) that the user will type in the most important items first and (b) the user will attempt to put words in the order they are to be found in the security name.

Yet we also have to consider the possibility of misspelled words. Fortunately, for this there is a handy built-in function called levenshtein. Basically this computes the number of steps it takes to go from one text to another. Say you type in “brikc” instead of “brick”. The levenshtein distance would be 1 since you merely have to switch the c and k to get the target text. However if you typed in “ibkcr” this would require several more steps to obtain the target text and so would have a considerably larger levenstein distance.

But all of this has to be put together in — yes I have to say it — an algorithm which processes the results quickly and painlessly.

In the interests speed, I actually did something which might surprise some readers. I decided to chop off any characters which exceeded a word length of 7.

Why?

The answer is that all of these functions increase processing time at an exponential rate. Limiting word length is essentially reducing the base number to which the exponent is applied. This can be a huge issue when the server encounters large query results. Suppose the user is looking for trust related securities and types in the abbreviation “tr”. This will obviously pull up a large number of results since not only are there many securities with the specific abbreviation “tr” but there are many words, like “tree” and “treasury”, which contain that combination of characters.

Just limiting all words to a maximum of 7 creates a huge savings in terms of processing.

Well, unfortunately I don’t want to divulge all of the secrets to writing the algorithm, but I will say that it took several days of experimentation and brainstorming to come up with a workable solution. In the end I came up with a pretty straightforward system that actually surprised me with its simplicity. Frankly it isn’t perfect, but it does a decent job and works very quickly, chomping through huge result sets with ease.

Leave a Reply

Your email address will not be published. Required fields are marked *