Describing DataFrame

You've seen an example of one way to manipulate data but pandas has many more. How many more? Put it this way, if you can imagine it, chances are, pandas can do it.

Let's start with string methods. Because pandas is based on Python, however you can manipulate strings in Python, you can do the same in pandas.

You can access the string value of a column using .str. Knowing this, how do you think you'd set a column to lowercase?

import pandas as pd
car_details = pd.DataFrame({ "Make"  : pd.Series(["Toyota", "Toyota", "Nissan","Honda", "Toyota"]),
                             "Colour": pd.Series(["White", "Blue", "White","Blue", "White"]),
                             "Odometer (KM)": pd.Series([150043, 32549, 213095, 45698, 60000]),
                             "Doors" : pd.Series([4, 3, 4, 4, 4]),
                             "Price" : pd.Series(["$4,000.00", "$7,000.00", "$3,500.00","$7,500.00", "$6,250.00"]) })
print(car_details)

Output:

   Make    Colour  Odometer (KM)  Doors     Price
0  Toyota  White         150043      4  $4,000.00
1  Toyota   Blue          32549      3  $7,000.00
2  Nissan  White         213095      4  $3,500.00
3   Honda   Blue          45698      4  $7,500.00
4  Toyota  White          60000      4  $6,250.00

Lower the Make column

print(car_details["Make"].str.lower())

Output:

0    toyota
1    toyota
2    nissan
3     honda
4    toyota
Name: Make, dtype: object

Notice how it doesn't change the values of the original car_sales DataFrame unless we set it equal to.

print(car_details)

Output:

     Make Colour  Odometer (KM)  Doors      Price
0  Toyota  White         150043      4  $4,000.00
1  Toyota   Blue          32549      3  $7,000.00
2  Nissan  White         213095      4  $3,500.00
3   Honda   Blue          45698      4  $7,500.00
4  Toyota  White          60000      4  $6,250.00

let do it now

car_details["Make"] = car_details["Make"].str.lower()
print(car_details)

Output:

     Make Colour  Odometer (KM)  Doors      Price
0  toyota  White         150043      4  $4,000.00
1  toyota   Blue          32549      3  $7,000.00
2  nissan  White         213095      4  $3,500.00
3   honda   Blue          45698      4  $7,500.00
4  toyota  White          60000      4  $6,250.00

Reassigning the column changes it in the original DataFrame. This trend occurs throughout all kinds of data manipulation with pandas.

Some functions have a parameter called inplace which means a DataFrame is updated in place without having to reassign it.

Let's see what it looks like in combination with .fillna(), a function which fills missing data. But the thing is, our table isn't missing any data.

In practice, it's likely you'll work with datasets which aren't complete. What this means is you'll have to decide whether how to fill the missing data or remove the rows which have data missing.

Let's check out what a version of our car_sales DataFrame might look like with missing values.

import pandas as pd

with pd.ExcelFile("E:\\Input_File.xlsx") as reader:
    Car_Detail_df = pd.read_excel(reader)

print(Car_Detail_df)

Output:

     Make  Color  Odometer (KM)  Doors  Price Date Launched
0  Toyota  White            NaN      4   4000    28/02/2000
1  Toyota    NaN        32549.0      3   7000    28/02/2000
2  Nissan  White            NaN      4   3500    22/09/2005
3   Honda    NaN        45698.0      4   7500    22/09/2005
4  Toyota  White        60000.0      4   6250    28/02/2000

Missing values are shown by NaN in pandas. This can be considered the equivalent of None in Python.

Let's use the .fillna() function to fill the Odometer column with the average of the other values in the same column. Fill Odometer column missing values with mean.

We'll do it with and without inplace.

Car_Detail_df["Odometer (KM)"].fillna(Car_Detail_df["Odometer (KM)"].mean(),
                                     inplace=False) # inplace is set to False by default

Now let's check the original car_sales_missing DataFrame.

print(Car_Detail_df)

Output:

     Make  Color  Odometer (KM)  Doors  Price Date Launched
0  Toyota  White            NaN      4   4000    28/02/2000
1  Toyota    NaN        32549.0      3   7000    28/02/2000
2  Nissan  White            NaN      4   3500    22/09/2005
3   Honda    NaN        45698.0      4   7500    22/09/2005
4  Toyota  White        60000.0      4   6250    28/02/2000

Because inplace is set to False (default), there's still missing values in the "Odometer" column. Let's try setting inplace to True.

Car_Detail_df["Odometer (KM)"].fillna(Car_Detail_df["Odometer (KM)"].mean(),
                                     inplace=True) # inplace is set to True
print(Car_Detail_df)

Output:

     Make  Color  Odometer (KM)  Doors  Price Date Launched
0  Toyota  White   46082.333333      4   4000    28/02/2000
1  Toyota    NaN   32549.000000      3   7000    28/02/2000
2  Nissan  White   46082.333333      4   3500    22/09/2005
3   Honda    NaN   45698.000000      4   7500    22/09/2005
4  Toyota  White   60000.000000      4   6250    28/02/2000

The missing values in the Odometer column have been filled with the mean value of the same column.

In practice, you might not want to fill a column's missing values with the mean, but this example was to show the difference between inplace=False (default) and inplace=True.

Whichever you choose to use will depend on how you structure your code. All you have to remember is inplace=False returns a copy of the DataFrame you're working with. This is helpful if you want to make a duplicate of your current DataFrame and save it to another variable.

We've filled some values but there's still missing values in car_sales_missing. Let's say you wanted to remove any rows which had missing data and only work with rows which had complete coverage.

The following two lines do the same thing

Car_Detail_df.dropna(inplace=True)
Car_Detail_df = Car_Detail_df.dropna()

print(Car_Detail_df)

Now if check again, the rows with missing values are gone and the index numbers have been updated.

Output:

     Make  Color  Odometer (KM)  Doors  Price Date Launched
0  Toyota  White   46082.333333      4   4000    28/02/2000
2  Nissan  White   46082.333333      4   3500    22/09/2005
4  Toyota  White   60000.000000      4   6250    28/02/2000

Instead of removing or filling data, what if you wanted to create it?

For example, creating a column called Seats for number of seats.

pandas allows for simple extra column creation on DataFrame's. Three common ways are adding a Series, Python list or by using existing columns.

seats_column = pd.Series([5, 5, 5], index=[0,2,4])
Car_Detail_df["Seats"] = seats_column

print(Car_Detail_df)

Output:

     Make  Color  Odometer (KM)  Doors  Price Date Launched  Seats
0  Toyota  White   46082.333333      4   4000    28/02/2000      5
2  Nissan  White   46082.333333      4   3500    22/09/2005      5
4  Toyota  White   60000.000000      4   6250    28/02/2000      5

You can also make a column by directly combining the values of other columns. Such as, price per kilometre on the Odometer.

Car_Detail_df["Price per KM"] = Car_Detail_df["Price"] / Car_Detail_df["Odometer (KM)"]

print(Car_Detail_df)

Output:

     Make  Color  Odometer (KM)  ...  Date Launched  Seats Price per KM
0  Toyota  White   46082.333333  ...     28/02/2000      5     0.086801
2  Nissan  White   46082.333333  ...     22/09/2005      5     0.075951
4  Toyota  White   60000.000000  ...     28/02/2000      5     0.104167

[3 rows x 8 columns]

As for column creation, you can also create a new column setting all values to a one standard value.

Car_Detail_df["Number of wheels"] = 4

print(Car_Detail_df)

Output:

     Make  Color  Odometer (KM)  ...  Seats  Price per KM Number of wheels
0  Toyota  White   46082.333333  ...      5      0.086801                4
2  Nissan  White   46082.333333  ...      5      0.075951                4
4  Toyota  White   60000.000000  ...      5      0.104167                4

[3 rows x 9 columns]

Now you've created some columns, you decide to show your colleague what you've done. When they ask about the Price per KM column, you tell them you're not really sure why it's there. You decide you better remove it to prevent confusion.

You can remove a column using .drop('COLUMN_NAME', axis=1). Why axis=1? Because that's the axis columns live on. Rows live on axis=0. Now, drop the Price per KM column from the dataframe.

Car_Detail_df = Car_Detail_df.drop("Price per KM", axis=1)
print(Car_Detail_df)

Output:

Make Color Odometer (KM) ... Date Launched Seats Number of wheels 0 Toyota White 46082.333333 ... 28/02/2000 5 4 2 Nissan White 46082.333333 ... 22/09/2005 5 4 4 Toyota White 60000.000000 ... 28/02/2000 5 4

[3 rows x 8 columns]

Let's say you wanted to shuffle the order of your DataFrame so you could split it into train, validation and test sets. And even though the order of your samples was random, you wanted to make sure.

To do so you could use .sample(frac=1).

.sample() randomly samples different rows from a DataFrame. The frac parameter dictates the fraction, where 1 = 100% of rows, 0.5 = 50% of rows, 0.01 = 1% of rows.

car_sales_sampled = Car_Detail_df.sample(frac=1)
print(car_sales_sampled)

Output:

Make Color Odometer (KM) ... Date Launched Seats Number of wheels 4 Toyota White 60000.000000 ... 28/02/2000 5 4 2 Nissan White 46082.333333 ... 22/09/2005 5 4 0 Toyota White 46082.333333 ... 28/02/2000 5 4

[3 rows x 8 columns]

Notice how the rows remain intact but their order is mixed (check the indexes).

.sample(frac=X) is also helpful when you're working with a large DataFrame. Say you had 2,000,000 rows.

Running tests, analysis and machine learning algorithms on 2,000,000 rows could take a long time. And since being a data scientist or machine learning engineer is about reducing the time between experiments, you might begin with a sample of rows first.

For example, you could use 40k_rows = 2_mil_rows.sample(frac=0.05) to work on 40,000 rows from a DataFrame called 2_mil_rows containing 2,000,000 rows.

What if you wanted to get the indexes back in order?. You could do so using .reset_index().

car_sales_sampled.reset_index(inplace=True)

print(car_sales_sampled)

Output:

   index    Make  Color  ...  Date Launched  Seats  Number of wheels
0      0  Toyota  White  ...     28/02/2000      5                 4
1      2  Nissan  White  ...     22/09/2005      5                 4
2      4  Toyota  White  ...     28/02/2000      5                 4

[3 rows x 9 columns]

Calling .reset_index() on a DataFrame resets the index numbers to their defaults. It also creates a new Index column by default which contains the previous index values.

Finally, what if you wanted to apply a function to a column. Such as, converting the Odometer column from kilometers to miles.

You can do so using the .apply() function and passing it a lambda function. We know there's about 1.6 kilometers in a mile, so if you divide the value in the Odometer column by 1.6, it should convert it to miles.

print(car_sales_sampled["Odometer (KM)"])

car_sales_sampled["Odometer (KM)"] = car_sales_sampled["Odometer (KM)"].apply(lambda x: x / 1.6)

print(car_sales_sampled["Odometer (KM)"])

Output:

print(car_sales_sampled["Odometer (KM)"])

car_sales_sampled["Odometer (KM)"] = car_sales_sampled["Odometer (KM)"].apply(lambda x: x / 1.6)

print(car_sales_sampled["Odometer (KM)"])

Now let's check our final car_sales_sampled DataFrame.

print(car_sales_sampled)

Output:

    index    Make  Color  ...  Date Launched  Seats  Number of wheels
0      0  Toyota  White  ...     28/02/2000      5                 4
1      4  Toyota  White  ...     28/02/2000      5                 4
2      2  Nissan  White  ...     22/09/2005      5                 4

[3 rows x 9 columns]

Describing DataFrame

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Further reading