Skip to content

Lab Assignment 3 Finding Facebook mutual friends in Spark , Comparing Rdd vs Data Frame in pyspark

Ruthvicp edited this page Jul 16, 2018 · 11 revisions

Team Id : 14

Member 1 : Ruthvic Punyamurtula

Class Id : 16

Member 2 : Shankar Pentyala

Class Id : 15


Source Code : Click Here

Video/Demo : Youtube-click here


Introduction

This lab assignment consists of finding Facebook mutual friends using Spark as the first part. The second part deals with comparison of RDD and Data Frame in PySpark based on Fifa data-set.

Objective

1. MapReduce algorithm for finding Facebook common friends on Apache Spark

Approach

First, take the input as discussed in the use case as "A -> B C D, B -> A C D E, C -> A B D E, D -> A B C E, E -> B C D" Then in map phase find the mutual friends of two people. Group them based on the mutual pair key and finally reduce them to get the mutual friends list.

Workflow

Create a mapper class as shown in the code snippet below. Each line of the input file is split based on "". Then its length is computed = 2, where the first part is source or base user and the rest of the split is considered as list of friends of the user. Then the keys are prepared as (A,B) or (B,A) based on the integer values of A & B in the input.

Create a reducer class where the data is grouped based on the key values (A,B) or (B,C) and their list of friends as produced. Then finally reduced to find the mutual friends of (A,B).

A main method which acts as a driver to set mapper and reducer class which takes the input and produces the output.

Data set and Parameter

The input file for this can be found at https://github.com/Ruthvicp/CS5590_BigDataProgramming/blob/master/Lab/Lab3/Source/facebook_combined.txt

Evaluation

Map reduce is very efficient in finding the common/mutual friends of two users when their list of friends are grouped together and filtered using reduce operation

Output

The snapshot for the output is given below

conclusion

Representing the mutual friends problem using the map reduce diagram


2. RDD Vs Data Frame in PySpark

Introduction

Consider the fifa dataset and perform certain operation and queries on RDD and DataFrame and find the comparison

Objectives

a) Create spark data frame and change the StructType of the columns

b) Perform certain operation to find insights on the data frame

c) Comparing RDD vs Data Frame using queries

Approach

Insights on DataFrame :

We load the fifa WorldCupMatches.csv and find the details regarding attendace, summary, semi-final games, 2014 worldcup etc

RDD vs DataFrame :

We perform similar queries on both of them to find the query and syntax complexity. Also we write sql queries on the spark DataFrame to get the same results

Workflow

Insights on Fifa DataFrame :

  • Creating data frame
  • Changing StructType
  • Performing filter,groupby,crosstab,dropna etc

RDD vs Data Frame :

####1. Find the host country and the top 10 highest no. of total goals scored

2. Find the year where hosting country = winning country

3. Find the world cup matches details for the years ending in Zero

4. Find the statistics of world cup 2014

5. Find the years, details in which maximun(64) no. of matches were played

Datasets & parameters

The data set can be downloaded from https://www.kaggle.com/abecklas/fifa-world-cup#WorldCupMatches.csv I have used 1. WorldCupMatches.csv 2. WorldCups.csv

Evaluation

####The output for the insights on data frame is given below

The output for part 2 - RDD vs DF are given here

1. Find the host country and the top 10 highest no. of total goals scored

2. Find the year where hosting country = winning country

3. Find the world cup matches details for the years ending in Zero

4. Find the statistics of world cup 2014

5. Find the years, details in which maximun(64) no. of matches were played

Conclusion

RDDs of Apache spark offers low-level functionality and control Whereas datasets offer higher functionality

References