-
Notifications
You must be signed in to change notification settings - Fork 25
Lab Assignment 3 Finding Facebook mutual friends in Spark , Comparing Rdd vs Data Frame in pyspark
Source Code : Click Here
Video/Demo : Youtube-click here
This lab assignment consists of finding Facebook mutual friends using Spark as the first part. The second part deals with comparison of RDD and Data Frame in PySpark based on Fifa data-set.
1. MapReduce algorithm for finding Facebook common friends on Apache Spark
First, take the input as discussed in the use case as "A -> B C D, B -> A C D E, C -> A B D E, D -> A B C E, E -> B C D" Then in map phase find the mutual friends of two people. Group them based on the mutual pair key and finally reduce them to get the mutual friends list.
Create a mapper class as shown in the code snippet below. Each line of the input file is split based on "". Then its length is computed = 2, where the first part is source or base user and the rest of the split is considered as list of friends of the user. Then the keys are prepared as (A,B) or (B,A) based on the integer values of A & B in the input.
Create a reducer class where the data is grouped based on the key values (A,B) or (B,C) and their list of friends as produced. Then finally reduced to find the mutual friends of (A,B).
A main method which acts as a driver to set mapper and reducer class which takes the input and produces the output.
The input file for this can be found at https://github.com/Ruthvicp/CS5590_BigDataProgramming/blob/master/Lab/Lab3/Source/facebook_combined.txt
Map reduce is very efficient in finding the common/mutual friends of two users when their list of friends are grouped together and filtered using reduce operation
The snapshot for the output is given below
Representing the mutual friends problem using the map reduce diagram
Consider the fifa dataset and perform certain operation and queries on RDD and DataFrame and find the comparison
a) Create spark data frame and change the StructType of the columns
b) Perform certain operation to find insights on the data frame
c) Comparing RDD vs Data Frame using queries
We load the fifa WorldCupMatches.csv and find the details regarding attendace, summary, semi-final games, 2014 worldcup etc
We perform similar queries on both of them to find the query and syntax complexity. Also we write sql queries on the spark DataFrame to get the same results
- Creating data frame
- Changing StructType
- Performing filter,groupby,crosstab,dropna etc
####1. Find the host country and the top 10 highest no. of total goals scored
The data set can be downloaded from https://www.kaggle.com/abecklas/fifa-world-cup#WorldCupMatches.csv I have used 1. WorldCupMatches.csv 2. WorldCups.csv
####The output for the insights on data frame is given below
RDDs of Apache spark offers low-level functionality and control Whereas datasets offer higher functionality