Python 3x Pandas Django

Describing DataFrame


One of the first things you'll want to do after you import some data into a pandas DataFrame is to start exploring it. Pandas has many built in functions which allow you to quickly get information about a DataFrame.

Let's explore some using the car_sales DataFrame.

import pandas as pd
car_details = pd.DataFrame({ "Make"  : pd.Series(["Toyota", "Toyota", "Nissan","Honda", "Toyota"]),
                             "Colour": pd.Series(["White", "Blue", "White","Blue", "White"]),
                             "Odometer (KM)": pd.Series([150043, 32549, 213095, 45698, 60000]),
                             "Doors" : pd.Series([4, 3, 4, 4, 4]),
                             "Price" : pd.Series(["$4,000.00", "$7,000.00", "$3,500.00","$7,500.00", "$6,250.00"]) })
print(car_details)

Output:

   Make    Colour  Odometer (KM)  Doors     Price
0  Toyota  White         150043      4  $4,000.00
1  Toyota   Blue          32549      3  $7,000.00
2  Nissan  White         213095      4  $3,500.00
3   Honda   Blue          45698      4  $7,500.00
4  Toyota  White          60000      4  $6,250.00

.dtypes shows us what datatype each column contains.

print(car_details.dtypes)

Output:

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price            object
dtype: object

.describe() gives you a quick statistical overview of the numerical columns.

print(car_details.describe())

Output:

       Odometer (KM)     Doors
count       5.000000  5.000000
mean   100277.000000  3.800000
std     78090.879483  0.447214
min     32549.000000  3.000000
25%     45698.000000  4.000000
50%     60000.000000  4.000000
75%    150043.000000  4.000000
max    213095.000000  4.000000

.info() shows a handful of useful information about a DataFrame such as:

How many entries (rows) there are

Whether there are missing values (if a columns non-null value is less than the number of entries, it has missing values)

The datatypes of each column

car_details.info()

Output:


RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Make           5 non-null      object
 1   Colour         5 non-null      object
 2   Odometer (KM)  5 non-null      int64
 3   Doors          5 non-null      int64
 4   Price          5 non-null      object
dtypes: int64(2), object(3)
memory usage: 328.0+ bytes

You can also call various statistical and mathematical methods such as .mean() or .sum() directly on a DataFrame or Series.

print(car_sales.mean())

Output:

Odometer (KM)    100277.0
Doors                 3.8
dtype: float64

Calling .mean() on a Series

car_prices = pd.Series([3000, 3500, 11250])
print(car_prices.mean())

Output:

5916.666666666667

Calling .sum() on a DataFrame

print(car_details.sum())

Output:

Make                             ToyotaToyotaNissanHondaToyota
Colour                                 WhiteBlueWhiteBlueWhite
Odometer (KM)                                           501385
Doors                                                       19
Price            $4,000.00$7,000.00$3,500.00$7,500.00$6,250.00
dtype: object

Calling .sum() on a Series

car_prices = pd.Series([3000, 3500, 11250])
print(car_prices.sum())

Output:

17750

Calling these on a whole DataFrame may not be as helpful as targeting an individual column. But it's helpful to know they're there.

.columns will show you all the columns of a DataFrame.

print(car_details.columns)

Output:

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

.index will show you the values in a DataFrame's index (the column on the far left).

print(car_details.index)

Output:

RangeIndex(start=0, stop=5, step=1)

Show the length of a DataFrame

print(len(car_details))

Output:

5

If you have any doubts or queries related to this chapter, get them clarified from our Python Team experts on ibmmainframer Community!