코세라 Week5 Lesson2 과제

2016. 6. 26. 10:57·Spark


./bin/pyspark 실행하고 코드 입력후 확인.

INFO SparkUI: Started SparkUI at http://192.168.0.154:4040

dㅇㅇ

Programming Assignment: Simple Join in Spark

You have not submitted. You must earn 100/100 points to pass.
Deadline
Pass this assignment by July 3, 11:59 PM PDT
  1. Instructions
  2. My submission
  3. Discussions

Make sure first you were able to complete the "Setup PySpark on the Cloudera VM" tutorial in lesson 1 of this module.

In this programming assignment we will implement in Spark the same code in the programming assignment in module 4 lesson 2 to perform a join of 2 different wordcount datasets.

Load datasets

First of all open the pyspark shell and load the datasets you created for the previous assignment from HDFS:

1
fileA = sc.textFile("input/join1_FileA.txt")

Let's make sure the file content is correct:

1
fileA.collect()

should return:

1
2
Out[]: [u'able,991', u'about,11', u'burger,15', u'actor,22']

Then load the second dataset:

1
fileB = sc.textFile("input/join1_FileB.txt")

same verification:

1
2
3
4
5
6
7
8
9
10
fileB.collect()
Out[29]:
[u'Jan-01 able,5',
u'Feb-02 about,3',
u'Mar-03 about,8 ',
u'Apr-04 able,13',
u'Feb-22 actor,3',
u'Feb-23 burger,5',
u'Mar-08 burger,2',
u'Dec-15 able,100']

Mapper for fileA

First you need to create a map function for fileA that takes a line, splits it on the comma and turns the count to an integer.

You need to copy paste the following function into the pyspark console, then edit the 2 <ENTER_CODE_HERE> lines to perform the necessary operations:

1
2
3
4
5
6
def split_fileA(line):
# split the input line in word and count on the comma
<ENTER_CODE_HERE>
# turn the count to an integer
<ENTER_CODE_HERE>
return (word, count)

You can test your function by defining a test variable:

1
2
test_line = "able,991"

and make sure that:

1
split_fileA(test_line)

returns:

1
Out[]: ('able', 991)

Now we can proceed on running the map transformation to the fileA RDD:

1
fileA_data = fileA.map(split_fileA)

If the mapper is implemented correctly, you should get this result:

1
2
3
fileA_data.collect()
Out[]: [(u'able', 991), (u'about', 11), (u'burger', 15),
    (u'actor', 22)]

Make sure that the key of each pair is a string (i.e. is delimited by ' ' ) and the value is an integer.

Mapper for fileB

The mapper for fileB is more complex because we need to extract 

1
2
3
4
5
6
def split_fileB(line):
# split the input line into word, date and count_string
<ENTER_CODE_HERE>
<ENTER_CODE_HERE>
return (word, date + " " + count_string)

running:

1
fileB_data = fileB.map(split_fileB)

and then gathering the output back to the pyspark Driver console:

1
fileB_data.collect()

should give the result:

1
2
3
4
5
6
7
8
9
Out[]:
[(u'able', u'Jan-01 5'),
(u'about', u'Feb-02 3'),
(u'about', u'Mar-03 8 '),
(u'able', u'Apr-04 13'),
(u'actor', u'Feb-22 3'),
(u'burger', u'Feb-23 5'),
(u'burger', u'Mar-08 2'),
(u'able', u'Dec-15 100')]

Run join

The goal is to join the two datasets using the words as keys and print for each word the wordcount for a specific date and then the total output from A.

Basically for each word in fileB, we would like to print the date and count from fileB but also the total count from fileA.

Spark implements the join transformation that given a RDD of (K, V) pairs to be joined with another RDD of (K, W) pairs, returns a dataset that contains (K, (V, W)) pairs.

1
fileB_joined_fileA = fileB_data.join(fileA_data)

Verify the result

You can inspect the full result with:

1
fileB_joined_fileA.collect()

You should try to make sure that this results agrees with what you were expecting.

Submit one line for grading

Finally, you need to create a text file with just one line for submission.

From the Cloudera VM, open the text editor from Applications > Accessories > gedit Text Editor.

Paste 1 single line of code from your pyspark console, the line related to the word actor: 

1
(u'actor', ???????????)

do NOT copy the comma at the end, the line should start with open parenthesis ( and end with closed parenthesis ).

In gedit, click on the Save button and save it in the default folder (/home/cloudera) with the name spark_join1.txt

Open now the browser within the Cloudera VM, login to coursera, and upload that file for grading.



 def split_fileA(line):

...     key_value=line.split(",")

...     word=key_value[0]

...     count=int(key_value[1])

...     return (word, count)



>>> def split_fileB(line):

...     key_value=line.split(",")

...     date_word=key_value[0]

...     count_string=key_value[1]

...     key_value2=date_word.split(" ")

...     word=key_value2[1]

...     date=key_value2[0]

...     return (word, date + " " + count_string)



반응형
'Spark' 카테고리의 다른 글
  • 개발환경
  • Spark RDD
Jadie Blog
Jadie Blog
  • Jadie Blog
    Jadie
    Jadie Blog
  • 전체
    오늘
    어제
    • 분류 전체보기 (44)
      • OOP (7)
      • DDD (1)
      • JAVA (8)
      • Spring (12)
      • Kafka (1)
      • TDD,Test (4)
      • Basic (1)
      • ETC (1)
      • MySQL (0)
      • Javascript (0)
      • Spark (3)
      • Infra (2)
      • Algorithm (0)
      • Network (1)
      • Jobs (0)
      • 일상 (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 미디어로그
    • 위치로그
    • 방명록
  • 링크

    • 휴튼
  • 공지사항

  • 인기 글

  • 태그

    Transactional Outbox
    메시징시스템
    글또
    MASTER
    jpa
    entitymanager
    OAuth2 #Spring
    MSA
    객체지향사실과오해
    HTTP #HTTPS
    테스트
    의존역전원칙
    localdatetime
    springboot
    JAVA #IO
    slave
    캡슐화
    Kafka
    Spring #ApplicationContext
    우아한스터디
    routingdatasource
    Resilience4jFeign
    Spring
    API문서
    추상클래스 #인터페이스
    java
    Test
    JPQL
    객체지향
    OOP
  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.1
Jadie Blog
코세라 Week5 Lesson2 과제
상단으로

티스토리툴바