Distributed Search System for IMDB Dataset

Overview

This project is a distributed system designed to perform fast and reliable searches on an extensive IMDB movie dataset. It leverages a robust architecture combining multiple technologies to ensure scalability, availability, and efficiency. The system integrates data pipeline components, distributed computing concepts, and load-balanced APIs to deliver high-performance search results.

Features

Distributed Data Pipeline

Kafka Messaging Queues: Utilizes partitions and replication for efficient data streaming.
Python Processing Scripts: Implements a master-slave concept and multiprocessing to handle large JSON datasets.
Logstash Integration: Directs processed data into specific Elasticsearch indexes for storage and search.

Scalable Search Architecture

Elasticsearch Nodes: Includes sharding and replication to ensure faster searches and high availability.
Distributed Search Execution: Django-based API supports distributed query handling.
React Frontend: Provides a user-friendly interface for querying the IMDB dataset.

Load Balancing

Django API: Multiple nodes ensure even distribution of search requests.
Elasticsearch Clusters: Configured for scalability and fault tolerance.

AWS Deployment

Kubernetes Deployment: YAML files configure and deploy Kafka, Zookeeper, Elasticsearch, Logstash, and application components on AWS.
Persistent Volumes: Configured using AWS EFS for durable storage of logs and indexes.

System Architecture

Data Pipeline

Producer: Python scripts populate Kafka topics with IMDB data.
Kafka and Zookeeper: Manages messaging queues with partitioning and replication.
Logstash: Consumes Kafka topics and directs data into Elasticsearch indexes.

Search

Elasticsearch Nodes: Configured with multiple shards and replicas for distributed search.
Django API: Acts as the backend interface for query execution.
React Frontend: Allows users to input queries and view results.

Deployment

Kubernetes: YAML files manage deployment, scaling, and high availability.
Persistent Volume Storage: AWS EFS ensures durability of logs and indexes.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Kafka		Kafka
Logstash		Logstash
Producer		Producer
Zookeeper		Zookeeper
django		django
reactclient		reactclient
.gitattributes		.gitattributes
Master-elastic-kibana.yaml		Master-elastic-kibana.yaml
Master-kafka-zookeeper.yaml		Master-kafka-zookeeper.yaml
Master-kafka-zookeeper1.yaml		Master-kafka-zookeeper1.yaml
Master-logstash.yaml		Master-logstash.yaml
Master.yaml		Master.yaml
Persistent-Volume-efs.yaml		Persistent-Volume-efs.yaml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Search System for IMDB Dataset

Overview

Features

Distributed Data Pipeline

Scalable Search Architecture

Load Balancing

AWS Deployment

System Architecture

Data Pipeline

Search

Deployment

About

Releases

Packages

Contributors 2

Languages

venkatsaijanumpally/Distributed-System-Design

Folders and files

Latest commit

History

Repository files navigation

Distributed Search System for IMDB Dataset

Overview

Features

Distributed Data Pipeline

Scalable Search Architecture

Load Balancing

AWS Deployment

System Architecture

Data Pipeline

Search

Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages