- Coding Videos/
- Data /
- Hands-On PySpark for Big Data Analysis : Computing Summary Statistics with MLlib | packtpub.com

# Hands-On PySpark for Big Data Analysis : Computing Summary Statistics with MLlib | packtpub.com

### Download Video link >

Hands-On PySpark for Big Data Analysis : Computing Summary Statistics with MLlib | packtpub.com

This video tutorial has been taken from Hands-On PySpark for Big Data Analysis. You can learn more and buy the full video course here https://bit.ly/2CeUDWO

Find us on Facebook — http://www.facebook.com/Packtvideo

Follow us on Twitter – http://www.twitter.com/packtvideo

source

### View Comments source >

### Transcript view all >

00:01 you

00:05 alright guys welcome back to PI spark

00:08 for beginners we're on to our chapter 5

00:11 talking about powerful exploratory data

00:14 analysis with ml Lib in this section

00:17 we're going to take a look at three

00:19 different topics number one how can we

00:22 compute summary statistics with ML Lib

00:26 number two using Pearson and Spearman

00:30 correlations to discover different

00:32 correlations in our data two data sets

00:34 and finally testing our hypotheses on

00:38 large data sets the first thing we're

00:42 going to talk about is computing summary

00:45 statistics with ML Lib in this video

00:49 we're going to take a look at number one

00:52 what our summary statistics and number

00:54 two how do we use ml lib to create

00:57 summary statistics let's jump in and

00:59 take a look so as you all know the ML

01:04 Lib is the machine learning library

01:06 coming with spark there's a recently new

01:10 development that allows us to use sparks

01:14 data processing capabilities to then

01:17 pipe on into machine learning

01:20 capabilities native to spark what that

01:23 means is we can use spark not only to

01:26 ingest collect and transform data we can

01:30 also analyze and use it to build machine

01:33 learning models all in this pi spark

01:35 platform which allows us to have a more

01:38 seamless deployable solution in

01:42 particular today I want to talk about

01:44 summary statistics so summary statistics

01:47 is a very simple concept you're probably

01:50 very familiar with something like an

01:52 average or a standard deviation or the

01:55 variance of a particular variable these

01:59 are summary statistics of a data sets

02:01 right because the reason why it's called

02:04 a summary statistic is because it gives

02:07 you a summary via a certain statistic so

02:10 for example when we talk about the

02:13 average of a data set we're talking

02:16 about

02:16 we're summarizing

02:18 characteristic of that data sets and

02:20 that characteristic is the average so

02:24 how do we compute summary statistics in

02:27 spark the key after here is the coal

02:31 stats function and what this does it

02:34 computes the column wise summary

02:36 statistics for and inputs are DD and you

02:40 can see that this accepts one parameter

02:43 which is an RDD and allows us to compute

02:46 different summary statistics using spark

02:49 so going back to our Jupiter notebook we

02:52 have our driven notebook chapter 5 here

02:54 and we are talking about computing

02:58 summary statistics with ml live so the

03:01 first thing I want to do like before is

03:03 in in 9 what we're doing is collecting

03:07 the data from the text file Kay didi

03:09 Cupp data gzipped and piping this into

03:13 the raw data variable after this because

03:17 the KD teacup data is almost separated

03:21 value file we first split this data by

03:25 the comma character in int n in this

03:29 line and put it in the csv variable

03:32 standing for comma separated values we

03:35 take the first feature of this data file

03:39 and this feature represents the duration

03:43 aspects of the data and so we're

03:46 transforming it into an integer here and

03:49 also wrapping it in a list and you're

03:53 gonna see why very quickly why we're

03:56 gonna wrap this in the list and this

03:58 helps us do summary statistics over

04:01 multiple variables and not just one of

04:03 them

04:04 to activate the call stats function what

04:08 we need to do is to import the

04:10 statistics package as seen here in the

04:13 first line of in 12 this statistics

04:17 package is a sub package of Pi spark dot

04:20 ml lid the stats and we then need to

04:23 call the Col stats function in the

04:27 statistics package and we feed it some

04:30 data in which case we were talking about

04:32 the duration data from your data sets

04:34 and we're feeding this summary summary

04:37 statistics into the summary variable to

04:41 access different summary statistics like

04:44 the mean standard deviation and so on

04:46 and so forth we can then call functions

04:50 of this summary objects here and access

04:53 different summary statistics for example

04:55 we cannot just a mean and because we

04:58 only have one feature in our duration

05:02 datasets we can index this by the index

05:05 0 and we'll get the mean of the data

05:07 sets similarly if we import the square

05:10 roots function from the Python standard

05:13 library we can then create the standard

05:16 deviation of the durations seen in the

05:18 data sets to illustrate what happens if

05:22 we don't

05:23 index this summary statistics with 0 we

05:27 can then see that summary Max and

05:29 summary min gives us back an array of

05:31 which the first element is - summary

05:34 statistic that we desire

05:38 and that's all there is to it so we've

05:41 just learned number one what our summary

05:43 statistics and number two how do we use

05:46 ml lib to create summary statistics

## Leave a Reply