- Coding Videos/
- Python /
- Python coding tutorial 3 – Data Manipulation w. Numpy and Pandas

# Python coding tutorial 3 – Data Manipulation w. Numpy and Pandas

### Download Video link >

Python coding tutorial 3 – Data Manipulation w. Numpy and Pandas

Alex Smith, Broad Institute, US presents : Python Programming Basics Tutorial 3 – Data Manipulation with Numpy and Pandas

source

### View Comments source >

### Transcript view all >

00:00 okay hello everyone my name is Alex

00:03 Smith I work at the Broad Institute of

00:06 Harvard and MIT as a computational

00:08 biologist and today I'll be showing you

00:11 how to do some basic manipulation with

00:13 two Python libraries called numpy and

00:17 pandas they're both very useful for

00:20 doing science and I'll show you how a

00:22 quick note on this presentation

00:25 previously in this online lecture series

00:27 there have been some other Python

00:29 tutorials which were done in Python

00:31 version 3 for this I wrote in Python

00:33 version 2.7 there should not be many

00:37 differences I think the only difference

00:40 you'll see is how print statements can

00:42 be ready in Python 3 you need to put

00:45 your print statements in brackets and

00:47 Python - it's a little lenient with

00:49 letting you just say print something so

00:52 I generally use the latter here so

00:54 please don't be confused by that if it

00:56 comes up part-1 numpy numpy is in short

01:04 a module for doing math in python now

01:07 you may think well python already can do

01:10 math and that's right but sometimes

01:12 you'll want to do math on large groups

01:15 of numbers you'll want to do say linear

01:19 algebra on vectors and matrices as you

01:23 often will in data science and

01:24 statistics and numpy provides utilities

01:29 and objects for doing this that are much

01:33 more robust and versatile at least for

01:36 math than pythons default list object so

01:43 I'm going to start with an example this

01:46 example is done in regular Python

01:49 without numpy we're gonna take a list of

01:53 diameters the distance across circles

01:56 and we're gonna write a method that

01:59 gives the areas of those circles based

02:03 on their diameters so we start out by

02:09 importing high from pythons built-in

02:12 math module

02:13 we initialize a list of diameters 0 to 9

02:21 we initialize our areas and we write a

02:25 for loop which for each index in areas

02:30 it multiplies pi times half the diameter

02:34 of each circle we need to make those

02:38 floating-point values a floating point

02:42 is a datatype for a number it just means

02:46 a decimal value rather than an integer

02:49 value without a point after it you'll

02:51 notice that in our declaration of

02:53 diameters we have integers and in our

02:56 output we have decimals so we need to

02:58 make sure to specify those floating

02:59 points in the division and of course

03:03 it's PI R squared for that area of your

03:05 circle so we have the squared operator

03:09 the exponent operator rather than Python

03:12 to the second power here and we print

03:14 them out and we successfully get the

03:17 diameters we're looking for as a check

03:19 we can look at the diameter too we know

03:21 that divided by 2 that's just 1 so we

03:26 get pi times 1 and we get PI for this

03:28 value so we can do that but there are

03:32 some issues we might like to make a

03:35 little smoother first of all there was

03:38 the issue I just discussed with range

03:40 with the range function in Python

03:43 generating integers we had to do all

03:46 this extra bookkeeping to make sure that

03:48 it didn't round to the floor when we

03:50 divided by 2 and we'd probably like to

03:53 avoid that unfortunately there's no way

03:55 to force range within range to output

04:00 floating-point numbers also we had to

04:05 write this for loop just to do math that

04:08 could be done on one line if you were

04:10 just writing math it's a property in

04:12 math that if you multiply a scalar value

04:15 like pi by a vector and arrangement of

04:18 numbers similar to the lists we

04:20 initialized with diameters it's a

04:24 property in math that if you multiply a

04:25 scalar by a vector

04:27 you get a vector with each element of

04:29 the vector multiplied by the scalar

04:31 value so why do we have to bother with

04:33 this for loop let's try doing it the way

04:36 we do in math let's just try writing

04:38 that straight in the Python and see what

04:40 happens so here we say okay this list

04:44 here will just treat the list as a

04:46 vector and we'll multiply our scalar

04:49 value by the whole list square it see

04:53 what happens and perhaps unsurprisingly

04:56 we get an error so yeah and math the

05:01 product of a scalar and a vector is a

05:03 vector but we're not multiplying a

05:05 scalar and a vector we're multiplying a

05:07 floating-point number and a Python list

05:11 object and the product the product of

05:15 those two things is and ever so if only

05:19 there were some data object that could

05:21 more closely represent a mathematical

05:23 vector

05:24 well there is an numpy hazard here if

05:32 you're not familiar with an import

05:34 statement this is import numpy a module

05:36 I've installed and Python as NP + NP is

05:40 an alias for none why so that when I

05:43 want to call numpy functions further on

05:45 I don't have to type them by every time

05:46 I can just type NP here we declare the

05:51 diameters and we use our first numpy

05:55 method a range a range is just like

05:57 pythons range except you can give it a

06:04 datatype so I specified that I want my

06:08 range from 0 to 9 to be floating-point

06:10 numbers and accordingly in the printed

06:13 statement here we see we get

06:15 floating-point numbers from that which

06:17 saves me having to do the bookkeeping

06:18 here and making two of floating points

06:21 that it knows to not to round and for

06:27 areas I do just what I did

06:28 just what I tried to do in the last

06:31 slide I multiply high which numpy knows

06:37 Straight by the array object here

06:42 divided by two and I square it and it

06:47 knows what to do with that we get the

06:49 same results we got when we used our for

06:52 loop but with this much simpler

06:54 statement that you'd write as you would

06:57 write what we initialized when we called

07:04 numpy as a range function was not a

07:07 Python list object but an object

07:09 specific to number I called an ND array

07:11 like an dimensional array and that is

07:15 the nice flexible powerful object we are

07:17 now working with and we can do math with

07:25 as the slice as it saves us some

07:27 bookkeeping and it has a lot of

07:28 different methods and there are several

07:31 different what we would call

07:32 constructors for the ND array and numpy

07:35 constructor just means it makes an

07:37 object so let's look at a few commenting

07:43 these lines so here's what we just did

07:47 numpy dot arranged if you just give it

07:49 an integer argument it'll count up to

07:51 that integer so there it's identical to

07:55 pythons range method if we give it that

08:01 integer I then also keep it data type

08:03 float it'll give you a floating point

08:05 values like in the last example here I

08:12 give it a start a stop and also a step

08:16 the step tells it how much to increment

08:19 in its count and since I gave the step

08:21 as a floating point value it gave an

08:23 identical result to this line and also

08:28 something you can't do it also something

08:32 you can't do it also something you can't

08:36 do with range is to decimal values under

08:40 one so I give it a start value of zero

08:42 stop value of one an increment of point

08:46 one which gives us

08:49 rheya values from ranging from zero to

08:52 point nine there's also a similar

08:56 constructor to arrange called

08:58 linspace I encourage you to give it a

09:00 try if you start coding some numpy on

09:03 your phone it's very handy so why was it

09:07 that our circle example worked it's

09:10 because numpy has a built in operation

09:12 for when you multiply things by and the

09:15 array is called broadcasting so we saw

09:18 that for the circle example Broadcasting

09:21 behaved just like the multiplication of

09:23 a scalar by a vector and math so you

09:27 might be curious well how does this work

09:29 once we're multiplying and he arrays by

09:31 other env arrays so we have a few

09:34 examples for you here here I've

09:38 constructed a few NZ array similarly to

09:41 how you saw in the last slide we have X

09:43 1 2 3 y 4 5 6 so let's multiply them by

09:49 each other and see what happens

09:50 we get 4 10 and 18 so we see this was

09:54 done element by element 1 times 4 gives

09:56 4 2 times 5 gives 10 and 3 times 6 gives

10:00 18

10:02 let's do their sum can you guess what

10:05 they'll be 1 plus 4 equals 5 2 plus 5

10:07 equals 7 3 plus 6 equals 9 and let's do

10:12 the quotient and difference we also see

10:17 these are done element by element 1

10:19 divided by 4 point 2 5 1 minus 4

10:22 negative 3 and so on so for 1d arrays in

10:30 numpy it goes element by element and in

10:34 fact that's a good guess as to array how

10:37 arrays of any shape will broadcast when

10:39 you perform basic operations on

10:50 critical difference between lists and

10:51 envy arrays is that nd arrays are

10:53 homogeneous which means that you can

10:55 only have objects of one data type

10:57 inside them here I initialize a Python

11:01 list and here I initialize an MD array

11:05 in the list I put a few different

11:08 objects I put a string I put a floating

11:10 point or decimal number I put this

11:13 boolean to 0 equal month and I put an

11:16 integer 7 and we see that it has no

11:21 problem printing out the string as I put

11:23 it in a floating point number the result

11:25 of that boolean 10 7 but for y I

11:29 initialize an entity array if I print Y

11:41 let me move that so it's easier to read

12:02 let me keep track of where I am here

12:05 [Music]

12:11 here I'm taking a closer look at the

12:13 before and after of Y when I try to

12:15 assign a floating point number to one of

12:18 wise indices even though I declared it

12:20 to be an MD array of integers so this is

12:25 how it was originally and with this

12:28 command I tried to set the last index to

12:30 the value to 0.5

12:31 well indeed Rays are homogeneous so it

12:34 stayed as an integer so when you try to

12:37 pass 2.5 as an integer around to the

12:39 floor and becomes 2 you can also nest

12:45 and E raised within env arrays that is

12:48 how you get MD arrays and not just 1d

12:51 arrays

12:58 let's start with these with these

13:01 constructors I initialized to 1d arrays

13:05 x goes from 0 to 9 and Y goes from 10 to

13:09 19 as you saw with previous arrange

13:11 examples then I make Z a combination of

13:15 these two with the numpy z-stack command

13:19 which is one of many ways to sort of

13:21 just stick arrays together in a manner

13:24 which you'll soon see is pretty

13:26 intuitive let's see what comes out of it

13:28 I printed the dimensions of Z so we now

13:33 have a two-dimensional and D array I

13:35 printed the shape of Z 2 by 10 returned

13:39 as an object called a tuple which is

13:41 like a list but in parentheses instead

13:45 of square brackets and finally here's

13:49 what Z looks like it's printed kind of

13:52 like a list of two lists or rather an

13:56 ende array of 2 nd arrays where the

13:58 first end the array is X as it was

14:00 declared up here and the second row or

14:04 nd array is why

14:05 also as declare I should add we can now

14:23 give Z pairs of coordinates and look at

14:26 different values so if for example we

14:29 want to look at Row one column column 5

14:34 while we'd give it index 0 for Row 1

14:37 because Python is 0 indexed and for

14:40 column 5 again because of the zero

14:42 indexing we'd give it column 4 so we

14:45 expect to see the number 4 here and

14:51 indeed we see 4

14:58 some other constructors for numpy let's

15:14 start with numpy dot ones here I have

15:21 numpy dot one is 10

15:24 what does this do it creates an MD array

15:28 of ones and it gives you the number of

15:31 ones you specify that's pretty

15:34 straightforward

15:41 what about numpy zeros well following

15:45 the previous example you might guess

15:47 gives you the number of zeros you

15:49 specify and actually for both zeros and

15:52 ones if you pass an as an argument this

15:57 type of arrangement called a tuple which

16:00 is a list of common separated numbers

16:02 between parentheses then that will

16:05 determine the shape of the output matrix

16:08 rather than the number of elements in an

16:10 output 1d array in this case I gave it

16:13 the to bowl containing 10 comma 10 so it

16:18 made it 10 wide and 10 high there are

16:30 also some constructors called ones like

16:34 there's also one called

16:36 zeros like and you pass it a nd array as

16:39 the constructing argument so I have my

16:42 table of zeros here set as Y so what

16:45 happens if I give it ones like Y well it

16:58 gives a matrix or an ende array of the

17:01 same shape with filled in with ones

17:03 often what you'll do is initialize a

17:06 matrix like that then

17:09 to set the values with another further

17:11 function just it is an example of how

17:16 broadcasting works with larger and D

17:19 arrays like this I'm just multiplying

17:20 two right by this empty array so what

17:25 happens well as you might guess

17:27 multiplying it by two multiplies every

17:29 element in it by two a little more

17:35 broadcasting at this time with between

17:38 2d arrays a little more broadcasting but

17:53 this time between 2d arrays I initialize

17:57 a 2d array X let's print X it looks like

18:04 this 1 & 2 in the first row 3 & 4 in the

18:07 second row i initialize another ndra

18:12 called why I use the constructor

18:14 identity which gives you the identity

18:16 matrix if you don't know what that is

18:18 you're about to see you it has ones

18:21 diagonally through it and it has the

18:24 property in math that if you do proper

18:26 matrix multiplication and multiply it by

18:29 another matrix of the same size it just

18:31 gives you the matrix you multiply it by

18:33 it's kind of like multiplying one by a

18:35 scalar value you just get that scalar

18:37 value so let's look at what happens when

18:44 we just broadcast them together with a

18:46 multiplier well if we're following what

18:52 we expect from math from matrix

18:53 multiplication then we would just get

18:55 one two three four obviously this is not

18:58 that so broadcasting for this does not

19:01 do what we would straightforwardly

19:02 expect from math rather it does as we

19:05 see element by element product like it

19:08 did for our 1d arrays it gives us 1

19:11 times 1/2 times 0 3 times 0 4 times 1

19:17 the proper thing to use for 2d arrays is

19:21 numpy dot mat mall for matrix

19:24 multiplication when we do that we get

19:29 what we expect one two three four

19:30 multiplication by the identity matrix a

19:34 more general mathematical multiplier for

19:40 nd arrays of any dimensionality is numpy

19:44 dot dot which for 2d arrays will also do

19:48 matrix multiplication there are many

19:55 other constructors for nd arrays I

19:57 haven't touched on you can read them in

20:01 the numpy documentation if you do any

20:03 further practice I'll give this same

20:05 link again at the end of this

20:06 presentation I'll go over one more here

20:11 which is how to initially how to

20:17 initialize a numpy array from a file on

20:20 your computer because you'll often have

20:21 a table or a text file full of numbers

20:24 that you just want to put right into an

20:26 array here on my desktop I had a file

20:30 called some numbers dot txt

20:33 it was just lists of numbers delimited

20:36 by commas in the first row I had 1 comma

20:41 2 comma 3 in the second I had 4 comma 5

20:43 comma 6 and so on so numpy reads it

20:48 straightforwardly

20:49 we have a 2d array 1 2 3 4 5 6 7 8 9

20:57 with practice you'll find eventually

21:00 that you can do pretty complex linear

21:04 algebra and statistics and all sorts of

21:07 great stuff with numpy in addition to

21:09 other modules notable ones are sci-fi

21:12 and math which together have graphing

21:16 and statistical utilities that are

21:18 really great and in fact they rely upon

21:22 numpy for many of their methods another

21:26 example of a module used for data

21:31 manipulation is pandas so how does this

21:35 differ from numpy well numpy we saw

21:38 dealt just with numbers and just with

21:42 homogeneous data so what if we want

21:45 something more powerful than pythons

21:47 list object but we don't want to do just

21:50 numbers you'll notice that I never had

21:54 any headers or anything for the nd

21:58 arrays I showed you with numpy we just

22:00 have the numbers and the only

22:02 bookkeeping we were able to do for them

22:04 was to assign them variable names in the

22:07 code itself well often you'll have very

22:10 large tables of text data sometimes text

22:14 mixed with numbers in from column to

22:17 column and for manipulating files like

22:21 that for looking through them quickly

22:22 for aggregating data from such tables or

22:26 grouping them in different ways and

22:28 forwarding and saving only what you want

22:31 pandas is really to go to utility for

22:35 doing that with Python here another

22:42 import statement import pandas and I

22:44 give it the alias PD and I have a table

22:50 a comma separated value text table on my

22:54 desktop called Stooges which is data

22:57 about the Three Stooges a American

22:59 comedy act from the early mid 20th

23:02 century so I read it and I output it and

23:07 this pandas not read CSV method it

23:11 outputs an object called the data frame

23:15 and the data frame is kind of two pandas

23:17 what the ND array is to numpy it's it's

23:20 big flexible object that everything sort

23:23 of revolves around so reading the CSV as

23:28 I wrote it well we can see what it looks

23:32 like but we see there are some problems

23:34 in that CSV I did not write a header row

23:41 I just had the data in there so it

23:43 thinks that the first row

23:45 pandas thinks the first row was the

23:48 headers for that and obviously these are

23:53 not headers these are data in addition

23:57 because sometimes the data you get isn't

23:59 perfect I left out a value

24:02 Larry finds final appearance here in the

24:05 Stooges and I also put in an incorrect

24:08 value the Year 2183 hasn't happened yet

24:12 so that can't be the last time curly Joe

24:15 Joe Rita appears and the Three Stooges

24:17 house we'll start by looking at that

24:23 header to fix the header it's pretty

24:31 simple we use the same argument to

24:33 import the data to a data frame the same

24:37 method I mean but we give it an

24:39 additional argument called names and two

24:42 names we pass a Python list so I want my

24:48 three columns to be labeled stooge for

24:51 their respective comedian first

24:53 appearance here our first appearance and

24:55 final appearance for the year of final

24:57 appearance let's see how that turns out

25:01 looks good to me so that worked okay for

25:18 a small table with data yoga and science

25:23 you often get very very large tables

25:26 such as lists of 600,000 genetic

25:30 variants each with their own respective

25:32 set of summary statistics so for those

25:36 who don't want to load all those into

25:37 memory at once and display them in a

25:40 window for those who might just want to

25:43 peek at the start or finish of them just

25:48 to make sure the first or last few rows

25:50 look like what do you expect them to so

25:53 here I do a command which is associated

25:57 with the data frame object I called

25:59 Stooges called head and I give it the

26:02 number one so it gives the first one

26:04 thing that appears in that data frame

26:08 here I give it the argument tail two you

26:11 might guess what this does based on how

26:13 head babe it gives the final two values

26:16 in that data frame you can also return

26:22 things other than these start or finish

26:24 sometimes you want things from the

26:26 middle or all throughout so I'm curious

26:30 about that missing the value I want to

26:32 know what type of thing it is I want to

26:35 take a closer look at it so let's break

26:38 this down in the first set of breath its

26:42 I specify what value I want from the row

26:47 so I want the date of final appearance

26:50 because that was where the missing value

26:51 was and here to filter for what I want I

26:55 use a boolean value a boolean is a true

27:00 false if you don't know evaluated with

27:01 this equal people statement so we look

27:05 at Stooges we want to see the stooge and

27:09 the student we're looking for is Larry

27:11 fine so let's enter this and it comes up

27:16 with the correct thing it comes up with

27:18 an A and not a number for the empty

27:21 value it's in the column final

27:24 appearance and it's data type it turns

27:26 out is a 64-bit floating-point number so

27:30 its treated as a decimal

27:36 now let's filter for these starting

27:38 Three Stooges the ones who debuted in

27:40 1930 this is a similar boolean to the

27:44 one I had last time

27:45 except now I'm doing one for multiple

27:47 objects instead of one because multiple

27:51 objects correspond to 1934 the initial

27:55 year since I have nothing in a first set

28:00 of brackets

28:01 it just gives all the values for things

28:03 that match the first appearance 1930 so

28:06 from that I got the original Three

28:07 Stooges moe Howard

28:09 Larry fine and Jim Howard

28:13 as I pointed out the year value I had

28:16 for curly Joe DeRita was quite an error

28:19 so maybe we would give a different sort

28:22 of boolean to filter stuff like that out

28:24 in this case I put final appearance

28:27 because that's where the error occurred

28:28 and I say okay keep it to 20th century

28:32 or earlier because they were a 20th

28:34 century act and not only does that

28:39 filter out curly Joe with the error put

28:44 in for his final appearance but it also

28:46 filters out the empty value for final

28:50 appearance for Larry fine because that

28:53 can't be compared to 2000 because it's

28:55 not there so what we end up here with

28:57 here is a set of data that we've

29:01 basically performed quality control

29:03 along and say if we're one stage in a

29:07 multi-stage analysis this data would be

29:10 ready to be passed on to the analysts

29:14 downstream from us so let's do that

29:20 let's save it to a file - as to the

29:22 analysts we take the name we assign the

29:26 data frame and we use the data frame

29:28 function to CSV to write it to something

29:31 I give it a file path I want to call it

29:34 Stooges fixed I put it in header dot

29:38 true because I want the header we gave

29:40 it in pandas to be saved to the text

29:42 file itself and I give index equals

29:45 false so that these numbers the indices

29:49 won't be included in that text file

29:51 they're useful in pandas but they're not

29:54 useful for the text file so I put that

29:58 in I'll go to my desktop and we'll take

29:59 a look at it and here's what came out as

30:08 we can see it looks like we would expect

30:16 and that's it for my introduction to

30:18 both of those modules I hope these gave

30:21 you an idea of the beginning of what

30:24 they can do but really I'm just

30:25 scratched the surface of the full range

30:27 of their capabilities included here are

30:30 links to the introductory pages for both

30:34 of them if you want to learn to use them

30:36 I really encourage going there and

30:38 learning to do so they're very powerful

30:40 very useful once they learn to do them

30:43 if anyone likes I can also email and you

30:46 want a copy of the code of this

30:47 presentation thank you very much for

30:49 listening I appreciate your attention

## Leave a Reply