Data manipulation: Python pandas tutorial

Published by Navneet Kishor on

Welcome to my blog post on tutorial of data manipulation with python pandas. In this post I’m going to cover various data manipulation techniques in python pandas data frame.

Pandas itself is a very huge topic and you can’t memorize each and every thing but the best you can do is practice, practice as much as you can and the best way to do so is playing with different datasets of your choice.

In this tutorial, you’ll learn about data manipulation; How to:

  • Sort data
  • Update rows and columns of a data frame
  • Filter out data from data frame
  • Add and remove row(s), column(s) and a data frame

This will also help you to perform sort of find and replace operation.

Before starting this tutorial make sure you already know what is pandas, some basics of pandas such as data frames, index, basic operations etc this will help you learn data manipulation in python pandas.

And if you don’t know these things already then you can check out my tutorial: Get started with pandas: Tutorial for beginners

1. Sorting data in a Pandas Data frame

To start with sorting data, first I’m creating a data frame :

import pandas as pd
df=pd.DataFrame(
data={'First_Name':['Peter','Eli','Martin','Keith'],
      'Last_Name':['Daniel','Stephens','Gibson','Norman'],
      'Age':['17','38','17','27']},
index=['1','2','3','4'])
df

Output:

Creating a small data frame in python pandas

Note: I’ve used Jupyter Notebook throughout this tutorial, so you need to run the code that I’ve provided. You can also use any other environment if you want. The only change you need to do is print the data frame: print(df) instead of just running df. (I hope you already know this because python is a prerequisite for learning python pandas.

So, we have our data frame named df containing first name, last name and age of four people. Here we can either sort data of whole data frame or of just one column i.e., sorting a series.

1.1 Sorting data of whole data frame

a) Sort by one column

To sort data of whole data frame, you need to select the column by which you want the data to be sorted. I will make this more clear with an example:

df.sort_values(by='First_Name',inplace=True)
df

Output:

Sorting data of whole data frame in pandas

Here, our data frame is sorted by First_name column, rest of the column entries are rearranged accordingly. you need to write inplace = True to save the changes.

The same can be sorted in descending order as well using ascending = False:

df.sort_values(by='First_Name',inplace=True,ascending=False)
df

Output:

Sorting data of whole data frame in pandas in descending order.

b) Sort by more than one column

Now, suppose I want to sort my data frame by Age column then you can see there are two same entries in it i.e., 17. So how it is decided which row should come first?

For this kind of situations, you can pass a list of sort by. If the column by which sorting is done has identical entries then second column in sort by list is used to sort those entries.

If you are getting confused about what I’m saying then don’t worry I’ll explain it with an example.

df.sort_values(by=['Age','Last_Name'],inplace=True,ascending=False)

Output:

Sort by multiple column: data manipulation technique in python pandas

In the above example, I’ve passed Age and Last_Name in by list, with ascending =False (i.e., in descending order).

So, at first , data frame is sorted according to Age column and for 17, sorting is done (in descending order) by Last_Name column (For descending order Gibson comes before Daniel).

c) Sort by more than one columns in different orders

Different orders for different columns in sort by list can also be specified using a list (of True and False).

For e.g., let’s say I want my data frame to be sorted by Age column in descending order and when it comes to identical values then sorting should be done by Last_Name column but in ascending order, then below is the code for it:

df.sort_values(by=['Age','Last_Name'],inplace=True,ascending=[False,True])
df

Output:

Sort by multiple columns in different orders:data manipulation technique in python pandas

d) Sort by index

You can sort your data by index as well:

df.sort_index(inplace=True)
df

Output:

Sort by index: data manipulation technique in python pandas

1.2 Sorting data of a series

You can also sort data of a series i.e., one column only. What I mean by that is entries of other columns won’t get affected.

df['First_Name'].sort_values()
df

Output:

2. Filtering out data

In this section I’m going to tell you how you can filter out data based upon conditions you give. First we need to set condition for the filter based upon which data is going to be filter out, then simply apply that filter to the data frame.

For this section, I’m using stack overflow developer survey dataset. You can download it from the link below:

https://insights.stackoverflow.com/survey

Or you can use any dataset of your choice.

Also if want to know details of data frame that I’m using here or how to read in data then check out my previous post Get started with pandas: Tutorial for beginners

First, let’s load our data:

df_survey = pd.read_csv(r'C:\Users\lenovo\Desktop\ml resources\stack_Overflow_Developer_survey_2020\survey_results_public.csv')

Now, you need to set some conditions for filtering out data. For this operators such as “==’ , “<“, “>” etc are used.

It is a two step process:

Step 1: Setting condition for filter

age_filt=(df_survey['Age1stCode']=="16")
age_filt

Output:

In the above line of code I’ve set age_filt to a condition having entries in Age1stCode only equal to 16. It returns a series of boolean data type i.e, True if condition satisfies otherwise False, for each entry in Age1stCode.

Step 2: Applying filter (condition) to the data frame

To apply set condition to our data frame you just need to simply write:

df_survey[age_filt]

On running above code, rows with Age1stCode = 16 will only be printed.

Also if you don’t want the whole data frame but a few columns to be printed with the given condition, then you can specify them:

df_survey[age_filt][['Respondent','Age1stCode']]

Output:

Here I’ve printed only Respondent and Age1stCode columns for the given condition.

You can also use dot notation. It will give you the same output

df_survey.loc[age_filt2,['Respondent','Age1stCode']]

Combining two different conditions

To combine two different conditions we use either:

  • OR operator denoted by ‘|’ , or
  • AND operator denoted by ‘&’ .

For e.g., to print Respondent and Age1stCode column only for developers with Age1stCode lying in between 15 and 25 we have:

age_filt2=((df_survey["Age1stCode"]>'15') & (df_survey["Age1stCode"]<'25'))
df_survey.loc[age_filt2,['Respondent','Age1stCode']]

Output:

Negating a filter

Negating a filter means applying condition opposite to the one which is set. For example negating age_filt2 will set the condition from 15<Age1stCode<25 to Age1stCode<15 and Age1stCode>25.

df_survey.loc[~age_filt2,['Respondent','Age1stCode']]

Output:

Filtering out rows with particular entries

You can set particular values to the filter instead of conditions. Doing this will filter out rows having one of these values in one of the columns.

age=['13','15','18']
filt=df_survey['Age1stCode'].isin(age)         #Setting filter
df_survey.loc[filt,['Respondent','Age1stCode']]  #Applying filter

Output:

Filter using string method

In this method, we specify a particular string value for a column and all the rows with that string value in the specified column filter out.

For e.g., if I want to filter out all those entries in LanguageWorkedWith column which contain “Python” as one of its string values, then I’ll write:

filt2=df_survey['LanguageWorkedWith'].str.contains('Python',na=False)   

Note: na=False is used to eliminate Null values.

Now let’s apply this filter and print Respondent and LanguageWorkedWith columns:

df_survey.loc[filt2,['Respondent','LanguageWorkedWith']]

Output:

Note: String method is different from normal filter. If you’ll apply normal filter for printing LanguageWorkedWith = Python then entries containing Python only will be printed.

filt1=df_survey['LanguageWorkedWith']=='Python'                       #Normal method
df_survey.loc[filt1,['Respondent','LanguageWorkedWith']]

Output:

3. Updating Data

In this section you’ll learn how to update data in a data frame. You can change: column header, row entries or change a particular entry.

3.1 Updating column header

For updating data section I’m going to use data frame named df that I’ve created earlier in section 1. Here is our data frame df

Now, I’m changing its column headers to ‘First’,’Last’ and ‘age’ using df.columns:

df.columns=['First','Last','age']
df

Output:

You can convert column headers to upper case using list comprehension :

df.columns=[x.upper() for x in df.columns]
df

Output:

You can also replace any particular character with the one you want in all column headers. For e.g., suppose I want ‘T’ in every column headers to be replaced with ‘t’ , so to do so , I need to use str.replace:

df.columns=df.columns.str.replace('T','t')
df

Output:

3.2 Rename column header

If you want some of the column headers to be changed instead of changing all column headers then you need to use rename, inside which a dictionary named columns is passed. This dictionary has key as old column name (which you want to change) and value as new column header.

e.g., renaming FIRSt as First_name and LASt as last_name:

df.rename(columns={'FIRSt':'First_name','LASt':'Last_name'},inplace=True)
df

Output:

3.3 Updating row values

If you look back at the top, in section 1, I’ve created data frame df with customized index values. And to update row values I’m gonna use dot notation and to use that for updating purpose, your data frame should have default index. So I’m doing that here using reset and drop

df.reset_index(inplace=True)
df.drop(columns='index', inplace=True)
df

Output:

Now, we’re all set to proceed.

I’m going to change Eli Stephens 38 to Sophie Ellis 13. For that I just need to assign a list to df.loc, inside loc index where I want my values to be updated is passed.

df.loc[2]=['Sophie','Ellis','13']
df

There are three columns in the data frame and so the first element in the list corresponds to first column, second element to the second column and so on.

Output:

Updating all values in a row:data manipulation technique in python pandas

If you ask can I only update last name and age of a person instead of whole row? then my answer is yes.

If you remember how we can access row using .loc then you must know that row with specific columns can be accessed.

Here you need to do the same; grab a row with specific column and assign it a list of new valus for those columns:

df.loc[2,['Last_name','AGE']]=['Kale','23']
df

output:

Updating few row values:data manipulation technique in python pandas

Updating a single value

For updating a single value you can use .at or .loc, both will do the same thing:

df.loc[2,'AGE']='33'
df

Or

df.at[2,'AGE']='33'
df

Output:

Updating a single value:data manipulation technique in python pandas

3.4 Update values with a condition

Here, you can update values which satisfy the given condition. First you need to filter out data for a particular condition and then update it with the new data.

Let’s say I want to change last name of all the people who are below 30 to Smith. Then to do so first I need to select rows of those people who are below 30 by applying filter and then update their Last_name column value to Smith.

filt=(df['AGE']<'30')
df.loc[filt,'Last_name']='Smith'
df

Output:

Update value in a data frame with a condition:data manipulation technique in python pandas

4. Add/Remove Row, Column and data frame to a data frame

In this section you’ll learn how to add or remove a row or column to a data frame, also how to add a data frame to another data frame.

4.1 Add column(s)

To add a column to data frame, you first need one. You can either create a new column by combining columns which are already present in the data frame or you can simply create one with passing values in the list. I’ll show you both ways one by one.

a) By combining two columns of data frame

First let’s create and a new column Full_name by combining First_name and Last_name of our data frame:

df['Full_name']=df['First_name']+" "+df['Last_name']
df

Output:

Add a new column:data manipulation technique in python pandas

b) By creating a new column

Now, let’s create a new column by passing values of our choice in the data frame:

df['Id_No.']=['001','003','002','005']
df

Output:

Add a new column:data manipulation technique in python pandas

c) By expanding a column in data frame

In this, we just expand a pre-existing column by splitting it over a character or space (generally a space) and make two columns out of it that are added our data frame.

Before doing that I’m going to delete First_name and Last_name columns:

df.drop(columns=["First_name","Last_name"], inplace=True)
df

Output:

Now, next step is to split Full_name over space:

df['Full_name'].str.split(" ", expand = True)

Output:

Here, expand = True is storing the two split values in two different columns. Remember, this doesn’t do anything to Full_name column in data frame and you need to store this expanded column to two new columns.

df[['First_name','Last_name']] = df['Full_name'].str.split(" ", expand = True)
df

Output:

Add columns by expand ing a pre-existing column in data frame:data manipulation technique in python pandas

4.2 Delete column(s)

To delete any column we use drop, inplace = True is used to save the changes in the data frame.

df.drop(columns=["First_name","Last_name"], inplace=True)
df

Output:

Delete column(s) :data manipulation technique in python pandas

4.3 Add row(s)

To add new row you need to use dot notation. You need to pass the index inside loc and remember here if you pass index which is already there in the data frame then it is going to be replaced by new row instead of adding. So be careful while adding a new row. But this is the case for default indices and for customized ones you can put in existing row label inside loc and it won’t replace the one which is already there.

(Here we have default indices).

df.loc[4]=['13','Sarah Williams','004']
df

Output:

Add row(s) :data manipulation technique in python pandas

4.4 Add a data frame

We can also add a adata frame to the one that we have. For that we can have a data frame either with completely different columns (new columns will be added) or some common columns then how our new data frame will look like we’ll see that here.

In order to add a data frame, I’m creating one :

df1=pd.DataFrame(
data={'First_name':['Samuel','Joseph'],
      'Last_name':['Smith','Taylor'],
      'Age':['12','21']},
index=['1','2'])
df1

Output:

Creating a new data frame.

Now, you are ready to add df1 to df

df=df.append(df1, ignore_index=True,sort=False)
df

Output:

Add a data frame to another:data manipulation technique in python pandas

Here you can see that df and df1 only have First_name, Last_name columns in common so for AGE, Full_name, Id_No. We have null values at index 5 and 6.

4.5 Delete row(s)

To delete a row, drop is used along with the index to be deleted.

df.drop(index=6,inplace = True)
df

Output:

Delete row(s) :data manipulation technique in python pandas

Deleting a row using a condition

Let’s set a condition in order to delete a row, i.e., the entry which satisfies a given condition, its index value will be passed to the index and the whole row containing that entry will be deleted.

Let’s say I want to delete row which contains Smith in its Last_name column.

filt=(df['Last_name']=='Smith')
df.drop(index=df[filt].index, inplace = True)
df

Output:

Delete a row using conditions:data manipulation technique in python pandas

With this I’m ending my tutorial on data manipulation with python pandas. I hope the concepts and code is clear to you. If you are not clear with something or you have any doubts then do let me know in the comments section. Please feel free to ask your questions, I’ll be happy to help you.

Categories: Blog

2 Comments

Get started with Python Pandas:Tutorial for beginners - YouthSuccesso · August 16, 2020 at 12:05 pm

[…] Also if you want to learn about data manipulation with pandas then you can check out my post Data Manipulation: Python pandas tutorial […]

Matplotlib vs Seaborn: A Guide for Beginners - YouthSuccesso · August 16, 2020 at 12:18 pm

[…] with pandas then click here. Also to learn about data manipulation using python check out our post data manipulation using […]

Leave a Reply

Your email address will not be published. Required fields are marked *