This API returns the top 10 similar results for a user query.10% of Stackoverflow's data is used. You can find it here.
P.S => For notebook.ipynb you can directly run the entire notebook and a flask API will be deployed.Use that for testing purose.All the instructions are mentioned in the notebook.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- docker
- python
- tensorflow
- flask / fastapi
- Clone the repo
https://github.com/budukhyash/semantic-search-engine
- Running this will start open distro's elastic search instance. Read more about it here
docker run -p 8200:9200 -p 8600:9600 -e "discovery.type=single-node" amazon/opendistro-for-elasticsearch:1.8.0
-
Download the dataset. extract it, download the USE4 Universal Sentence Encoder by Google. Make sure the downloaded files are in the directory of the repository.
-
For data ingestion run. X denotes the number of documents to be indexed.
example => python elastic_search_ingestion.py X
python elastic_search_ingestion.py 20000
5.After the ingestion is completed. You can start the server by running
uvicorn server:app --reload --port 9999
- Postman Docs
- After starting the server docs can be found here.
- http://localhost:9999/docs#/
- You should see something like this.
-
- /semantic returns the top 10 most similar results, this considers the semantic meaning of the query and uses cosine similarity to rank the documents.
- /keywords returns the most similar results , this uses the traditional keyword approachusing an inverted index.Elastic search uses a TF-IDF based scheme to rank these documents.
- Response time (Ingested 1 lakh documents)
- sub 300ms for semantic search
- sub 150ms for keyword based search.