In 2020 Sep. I implemented a simple search system from scrach. The repo is at Here. The result is something like this:

Of course, there is a story behind..

In 2020 Aug, I got interview opportunity of Apple Machine learning Engineer. Interestingly, the first interview is conducted with how to design a search system. As an algorithm developer or data scientist, I spend most of my time on applying/customizing algorithms and learning the domain knowledges. Developing simple systems is still ok to me. But a search system? Never done it before. Absolutely unexpected and of course, it went south.

The failure motivates me to implement a simple search system from scratch. The following are the summary of what I learned and implemented.

1. Project Introduction

A simple search system that mostly written in python. Usually, a search/recommend system is composed by several key components (1) Web Crawler (2) Indexing (3) Recall model/mechanism (4) Rank model (5) Query Web-Interface.

1.1 Web Crawler

  • Using selenium and beautifulsoup4 to fetch the data from https://stackoverflow.com that mostly related to database. In addition, currently only crawling the title as document for storage concerns. This reop use sqlite3 as forward-database for simplicity.

1.2 Indexing

  • Using nltk and gensim to do the text normalization, stemming, lemmatization and doc2vec.

  • Using a customized object as inverted (index) database. For real applications, we can chose leveldb, dynamodb or rockdb. The customized object is basically a k-value object with the optimization that applies tire-tree on inverted index.

    index (key word) Value
    sql [ (doc_id, term_frequency), …]
    database [ (doc_id, term_frequency), …]
  • There are many advanced efficient index way, and for vector-database the indexing method often consisted of clustering and quantization methods.

1.3 Recall Model / Mechanism

  • Here using traditional bool-query. It could be further enhance with term-frequency or even a model to recall the candidates.

1.4 Ranking Model / Mechanism

  • Here using Doc2Vec from gensim. The pretrained weight is from https://github.com/jhlau/doc2vec.

  • The idea is to recommend the item base on vector similarity after recall-model.

1.5 Query Interface

  • Using Flask and Jinja template with bootstrap UI framework.

2. Usage Flow

  • install the requirement packages
  • install chrome driver for selenium in data/web_driver
    • It should align with the config.py
  • python run_web_crawler.py
    • It would create a forward database and storing blob in local machine.
  • python run_model_indexer.py
    • It would run text normalization and then build the inverted-index object.
  • python run_query_interface.py
    • It would run a flask server on localhost:5000.
  • Config File
    • forward_database = r‘data/sqldb/example.db’
    • inverted_database = r‘data/sqldb/inverted_key_word_db.pkl’
    • Doc2Vector300D_model_weight = r“data/model/doc2vec.bin”

3. Result Comparison

Current result just using traditional bool query, and not counting lots features like voting, user persona, content of question, and content of reply answers.

This Repo Original Website

4. Further Improvement

  • There are lots of things we could improve on and play with:
    • Model-based recall mechanism: Personally, I also build the vectors database of title-text, however not index it yet for fast-read. There are some ways to index the dense-vector database via some clustering or quantization methods. I would like to try some of them after more understanding of them. Also, I am not sure how much improvement would it bring to the overall performance.
    • Stronger Ranking model: There are many rooms to improve the ranking model, like taking extra-features into the model or using a better model.
    • Distributed Storage: we definitely can shading the database, both on forward-database and inverted database.
    • Crawling for more content: Due to storage limitations, I treat the title of the document as entire document. However, we could also crawling and parsing content as well.
    • Advanced Indexing method: if the data is large, we may need to do some compression on index. In addition, for dense vector-based search/recommend system, we may need to using clustering/product quantization plus multi-index method to index the database. There are several famous method like Non-Orthogonal Inverted Multi-Index or IVFADC.