00:01 you

00:07 hello and welcome back to PI spark for

00:09 beginners we're on chapter 4 where we're

00:13 talking about aggregating summarizing

00:16 data into useful reports in this section

00:20 we're going to take a look at number one

00:23 calculating averages with MapReduce and

00:27 then we're going to move on to number

00:28 two

00:28 faster average computations with

00:31 aggregate and at the very last we're

00:34 going to talk about pivot table with key

00:37 value pair data points

00:40 let's first talk about calculating

00:42 averages with map and reduce we're going

00:46 to answer three questions in this video

00:48 how do we calculate averages what is map

00:51 it was reduce let's take a look so by

00:55 now you should be fairly familiar with

00:58 my view of how we can navigate through a

01:02 new piece of software the first thing to

01:05 do is always checking out the

01:07 documentation here I've opened up the

01:10 documentation for map map takes two

01:13 arguments one of which is optional

01:16 the first argument to map is F which is

01:19 a function that gets applied to the RDD

01:23 through outs by the function map and the

01:27 second argument or parameter if you may

01:30 is purser partitioning which is

01:33 defaulting to false if we look at the

01:36 documentation it says that's map simply

01:39 returns the new RDD by applying a

01:42 function to each element of this RTD and

01:46 obviously this function refers to F that

01:49 we feed into the map function itself

01:53 there's a very simple example below in

01:58 the documentation where it says if we

02:01 parallel lies an RDD that contains a

02:04 list of three characters B a and C and

02:08 we map a function that creates a tuple

02:13 off of each elements will create a list

02:17 of three tuples which the original

02:20 character is placed in the first

02:23 elements of the tuple and the number or

02:26 the integer one is placed in the second

02:30 now let's look at reduce so reduce takes

02:35 only one argument which is F and this F

02:39 is a function where it uses this

02:42 function f to reduce a list into one

02:46 number so we look at it in a more

02:49 technical point of view it reduces the

02:51 elements of this RDD using the specified

02:54 competitive and associative binary

02:56 operator so you don't really need to

02:58 understand this I'm going to show you

02:59 through examples what this means and if

03:03 we look at the example again we're

03:04 simply taking a list of five items and

03:08 we're going to add them together let's

03:12 dig into a real example using the kdd

03:15 data that we have been using throughout

03:19 so we first go into our Jupiter nope

03:23 let's dig into a where we launched a

03:26 spark notebook so we used the method

03:28 specified in the last videos to launcher

03:32 to burn up of instance that links to a

03:34 spark instance like before we create a

03:38 raw data variable by loading a text file

03:41 from the local disk the next thing to do

03:45 is to split this file into comma

03:48 separated values

03:56 and then we're like to filter for rows

04:00 where the forty-first feature features

04:05 the word normal

04:12 the next thing to do is to use the map

04:15 function to convert this data into an

04:18 integer

04:24 and then finally we can use the reduced

04:26 function to compute the total duration

04:37 and then we can print the total duration

04:45 and there we have our total duration and

04:49 so the next thing to do is to divide

04:52 this total duration with the counts of

04:54 the data

05:02 you

05:05 and often a little competition we would

05:10 have created two counts using map and

05:14 reduce we've just learned how we can

05:17 calculate averages with PI spark and

05:19 what is map and reduce functions in PI

05:23 spark

## Leave a Reply