Get started with Python Pandas:Tutorial for beginners

Published by Navneet Kishor on

Want to become a Data Scientist Or Data Analyst and yet don’t no where to start with? If you have learned python and like writing codes in python. And want to get into data analysis or become a data scientist then python pandas is a must for you. This tutorial is for the beginners who want to get started with python pandas. At the end of this tutorial you’ll have an overview of pandas basics.

In this tutorial I’ve included all the basic concepts which you require as a beginner to start working with python pandas. The topics which are covered in this post are:

  • What is Python Pandas?
    • Why to use Python Pandas
  • Getting started with Pandas
    • Installing and importing pandas
  • Data frames and Series in Python Pandas
    • What is a Pandas Data frame
      • Creating a data frame
    • What is a Pandas Series
      • Creating a series
  • Reading data
  • Basic operations of pandas data frame
    • Viewing the data
      • The whole data frame
      • First few row
      • Shape
      • Describe
    • Accessing column(s)
    • Accessing row(s)
      • Indexes of a data frame
      • iloc
      • loc

What is Python Pandas?

Pandas is an open source library for python programming language developed by Wes Mckinney with a number of useful data analysis and manipulation tools. Pandas is built on the top of python Numpy and is used to manipulate tabular data and time series

Why to use Python Pandas ?

Pandas supports different file formats such as CSV, XML, JSON, HTML etc and we can read and analyse files of different formats. The complete list of file formats which can be imported into pandas is as follows:

  • Plain text (.txt)
  • Comma Separated Values (CSV)
  • PDF
  • DOCX
  • JSON
  • HTML
  • XML
  • XLSX
  • ZIP
  • Images
  • MP3
  • MP4
  • SQL
  • Hierarchical Data Format

Other than that, pandas has many other advantages and has very high performance as it is built on the top of numpy. It is one of the most downloaded packages in python. A number of operations such as slicing the data frame, concatenation, changing the index etc., we’ll see how to do that in later sections.

Getting started with the basics of Pandas

Now, in order to get started with pandas i would suggest you to go through with some basics of Python’s Numpy library if you are not already familiar with it. Also, I am assuming that you already have a fair knowledge of python programming language because without that there is no point in learning pandas so that’s a prerequisite for this tutorial. Now, we are ready to move forward, let’s get started.

Installing and importing pandas

Installing pandas package is an easy task. You just need to open the program terminal or the command prompt according to whether you’re a Mac or Windows user respectively and type use the following command to install pandas

pip install pandas 

Or if you have anaconda installed in your system then you can also run following command in anaconda powershell prompt

conda install pandas

Also, you can run the following command in your jupyter notebook cell.

!pip install pandas

After you are done with installing pandas package, the next step is to import it in whatever environment you are using. I will be using jupyter notebook throughout this tutorial and would also recommend you to use jupyter notebook too. By using jupyter notebook you can see output of every cell that you have written code in and It will be helpful for you as a beginner to better analyse the data and actually see what is going on in each step helping you better visualize your data.

Now, to import pandas you need to run in your notebook cell:

import pandas as pd

Data frames and Series in Python Pandas

Before going further in this tutorial you first need to know what different types of data structures are available in pandas.

There are three data structures available in pandas :

  1. Data frame
  2. Series
  3. Panel

The two most used data structures in pandas is Data frames and Series and in this tutorial you’ll be learning about data frames and series and you can go for learning panel once you grasp the concept of these two.

For now, all you need to know about panel is that it’s something that is used for storing three dimensional data.

Note that all the three data structures of pandas are value mutable. Series is size immutable whereas the two are size mutable.

What is a Pandas Data frame?

Data frame is nothing but a combination of rows and columns. It is a two dimensional data structure.

An example of Pandas DataFrame: python pandas tutorial for beginners
A Pandas DataFrame

Creating a Data frame

If you remember that how a string or a list can be stored corresponding to a key in Python Dictionary then you can understand the following code:

students={
    'Roll_No.':'1',
    'Name':'Erie',
    'Age':'12'
}
students

As i said before you can run above code in any environment (Note:you need to use print(students) in place of students if you’re not using jupyter) but i would suggest you to run it in a jupyter notebook cell, output of which will be same as given in the figure below

Output:

A dictionary in python with one key element: python pandas tutorial for beginners

So, we have dictionary with keys: Roll_No. , Name and Age . Now if we have data of more than one student and we want to create a data frame then we can create a dictionary and convert it into a data frame

students={
    'Roll_No.':['1','2','3'],
    'Name':['Erie','Esha','Peter'],
    'Age':['12','12','13']
}
students

Output:

A dictionary in python with multiple key elements: python pandas tutorial for beginners
df_stds=pd.DataFrame(students)
df_stds

Output:

Creating Pandas data frame using dictionary: python pandas tutorial for beginners

We have created our own data frame here having roll number, name and age of three students. Now let’s see what is a pandas series and how to create one.

What is a Pandas Series?

A series can be defined as the labelled one dimensional array. In simple if you pick out a single column from a data frame then it is nothing but a series.

A Pandas series: python pandas tutorial for beginners
A Pandas series

Creating a Series

A series can be formed using lists.

name=['Erie','Esha','Peter']
sr_name=pd.Series(name)
sr_name

Output:

Creating pandas series using lists: python pandas tutorial for beginners

Reading data

In pandas data can be loaded from a number of file formats which i have already discussed. For this tutorial I’ve loaded data from a .CSV file. I have taken stack overflow developer survey data. Below is the link from where you can download it or you can use any other data if you want.

https://insights.stackoverflow.com/survey

Open above link and click on download full data set to download the CSV file. You can download any data set from the list i have used 2020 data set.

Once you finish downloading, then you can see it has two CSV files,

  • survey_results_public which contains 64461 responses of different people to total 60 questions asked.
  • survey_results_schema which has 61 rows and 2 columns. First column contains column names of survey_results_public and second column contains meaning of corresponding entry in first column for a row.

This will be more clear to you when you’ll be able view the data frame in jupyter notebook but before that you need to read data from CSV file.

For reading data from CSV file in a data frame, you need to run following code:

df = pd.read_csv('survey_results_public.csv')

or you can pass inside pd.read_csv() the file path of your csv file where it is stored:

df = pd.read_csv(r'C:\Users\lenovo\Desktop\ml resources\stack_Overflow_Developer_survey_2020\survey_results_public.csv')

Basic operations of Pandas Data frame

In this section you’ll be learning about various basic operations that we can perform on data frames. I’ll be continuing this tutorial in

1. Viewing the data frame

1.1 The whole data frame

let’s see what our data frame looks like. Just run df (file data is loaded in data frame named df) in a cell :

df

If you run the above block of code in you jupyter notebook then you can see in the output that after there is a jump to in the column shown by three dots. But what if you want to see columns by scrolling, don’t worry you can do that with the help of following code:

pd.set_option('display.max_columns',61)  

Here max. columns is set to 61 because there are total 61 number of columns. Now run df again and see all the columns are visible now. Similarly, you can use set_option to make all the rows visible as well:

pd.set_option('display.max_rows',64461)

1.2 First few rows

You can view first five rows of a dataframe:

df.head()

Or you can specify how many rows you want to view. Let’s say i want to view first three (you can pass in any number of rows) rows :

df.head(3)

Output:

View first few rows of data frame using head: python pandas tutorial for beginners

You can also view last five rows of a data frame as well:

df.tail()

Or you can specify number of rows here as well:

df.tail(2)

Output:

1.3 Shape

Shape of a data frame returns a tuple containing the number of rows and columns in the data frame:

df.shape

Output:

Note: Here you won’t use df.shape() because it is not a method but an attribute.

1.4 Describe

The describe() calculates minimum, maximum, mean etc. of columns (having numeric values) in a data frame.

Parameters:

  • include: includes list of data types. Default value is none.
  • exclude: excludes list of data types. Default value is none.
  • percentiles: list of numbers to specify percentiles to return. it has Default value [.25,.5,.75].
df.describe(percentiles=[0.22,0.33],exclude=None,include=None)

Output:

2. Accessing Column(s)

You can access a single column of a data frame. Let’s say we want to access the Age column:

df['Age']

Or

df.Age

Output:

You can also check the type of df.Age, it will show you a series:

type(df.Age) 
#or
type(df['Age]) 

Output:

You can also access more than one column at a time:

df[['Age','Hobbyist']]

Output:

3. Accessing row(s)

Row(s) of a data frame can either be accessed by using iloc() Or loc().

  • iloc() : in iloc() searching is done by position number.
  • loc() : in loc() searching is done by index name or the label.

# Indexes of a data frame

Before going into the details of these two methods I would give a brief introduction about the index name or the label. Because there is a possibility that you get confused between iloc() and loc() without the clear picture of what is default index, label, position number etc.

For a given data set we have :

  1. Position number
  2. Index name/ Label
Position number :

For any row or column position number is integer location. Starting from zero it goes to total number of rows/columns minus one.

For e.g., in data frame of survey_result_public dataset, the position number of: first column (i.e., Respondent) is 0, second column (i.e., MainBranch) is 1, third column (i.e., Hobbyist) is 2 and so on. Similarly for rows it is 0,1,2… .. 64460 (total number of rows is 64461).

Index :

If you look at the data frame then there are various columns named Respondent, Age, Hobbyist etc these all are column headers. These are technically indexes, let me show you how.

df.columns

Output:

So here you can see that all the column names are inside the tuple ‘index’. It is a special object type.

Coming next to the row labels or the index, you can see, there are numbers 0,1,2 and so on these all are the row labels.

df.index

Output:

These are default indexes you can customize them as well. Let’s say i want Respondent column as my index column (but before that make sure each entry in the column is unique because no two row labels can be same) :

df.set_index('Respondent',inplace=True)
df.head()

Output:

So, i hope that you’re pretty much clear about the indexes. Now we can continue with our iloc() and loc().

iloc:

In this you pass position number of row you want. For e.g., for first row:

df.iloc[0]

you can access multiple rows by passing their position number separated by comma. Also you can specify the columns if you want:

df.iloc[[0,1],[0,2]]

Output:

In the above example, there are two inner lists inside iloc list. First one is for row number (1st row has position no. 0 , 2nd has 1 and so on). Second inner is for columns.

NOTE: We have MainBranch (not Respondent) and Age columns for 0 and 2 in above example as Respondent is no more a column, it is an index column.

loc:

loc is similar to iloc but you pass labels and column names inside it in place of position numbers. For e.g.,:

df.loc[[1,2],['Hobbyist','Age']]

Output:

This is it for this tutorial. I hope the concept and the code is clear to you as it is a simple and quick introduction to python pandas. If you have any doubts then feel free to ask. You can ask your questions in the comments section, I’ll be very happy to help you.

Also if you want to learn about data manipulation with pandas then you can check out my post Data Manipulation: Python pandas tutorial

Categories: Blog

2 Comments

Data manipulation: Python pandas tutorial - YouthSuccesso · July 24, 2020 at 5:46 pm

[…] And if you don’t know these things already then you can check out my tutorial: Get started with pandas: Tutorial for beginners […]

Matplotlib vs Seaborn: A Guide for Beginners - YouthSuccesso · August 16, 2020 at 12:17 pm

[…] you want to get started with pandas then click here. Also to learn about data manipulation using python check out our post data manipulation using […]

Leave a Reply

Your email address will not be published. Required fields are marked *