./bin/pyspark 실행하고 코드 입력후 확인.
INFO SparkUI: Started SparkUI at http://192.168.0.154:4040
dㅇㅇ
Programming Assignment: Simple Join in Spark
Deadline | Pass this assignment by July 3, 11:59 PM PDT |
Make sure first you were able to complete the "Setup PySpark on the Cloudera VM" tutorial in lesson 1 of this module.
In this programming assignment we will implement in Spark the same code in the programming assignment in module 4 lesson 2 to perform a join of 2 different wordcount datasets.
Load datasets
First of all open the pyspark shell and load the datasets you created for the previous assignment from HDFS:
Let's make sure the file content is correct:
should return:
Then load the second dataset:
same verification:
Mapper for fileA
First you need to create a map function for fileA that takes a line, splits it on the comma and turns the count to an integer.
You need to copy paste the following function into the pyspark console, then edit the 2 <ENTER_CODE_HERE> lines to perform the necessary operations:
You can test your function by defining a test variable:
and make sure that:
returns:
Now we can proceed on running the map transformation to the fileA RDD:
If the mapper is implemented correctly, you should get this result:
Make sure that the key of each pair is a string (i.e. is delimited by ' ' ) and the value is an integer.
Mapper for fileB
The mapper for fileB is more complex because we need to extract
running:
and then gathering the output back to the pyspark Driver console:
should give the result:
Run join
The goal is to join the two datasets using the words as keys and print for each word the wordcount for a specific date and then the total output from A.
Basically for each word in fileB, we would like to print the date and count from fileB but also the total count from fileA.
Spark implements the join transformation that given a RDD of (K, V) pairs to be joined with another RDD of (K, W) pairs, returns a dataset that contains (K, (V, W)) pairs.
Verify the result
You can inspect the full result with:
You should try to make sure that this results agrees with what you were expecting.
Submit one line for grading
Finally, you need to create a text file with just one line for submission.
From the Cloudera VM, open the text editor from Applications > Accessories > gedit Text Editor.
Paste 1 single line of code from your pyspark console, the line related to the word actor:
do NOT copy the comma at the end, the line should start with open parenthesis ( and end with closed parenthesis ).
In gedit, click on the Save button and save it in the default folder (/home/cloudera) with the name spark_join1.txt
Open now the browser within the Cloudera VM, login to coursera, and upload that file for grading.
def split_fileA(line):
... key_value=line.split(",")
... word=key_value[0]
... count=int(key_value[1])
... return (word, count)
>>> def split_fileB(line):
... key_value=line.split(",")
... date_word=key_value[0]
... count_string=key_value[1]
... key_value2=date_word.split(" ")
... word=key_value2[1]
... date=key_value2[0]
... return (word, date + " " + count_string)