Pandas for Data Cleaning:

Follow along the Project to learn with me.

20 min readJun 5, 2023

You can get the Project code here: GitHub

Check out my Portfolio: Portfolio Website

Project: Data Cleaning

Welcome to my data cleaning project, where I will be working with the FIFA 2021 dataset obtained from Kaggle.
You can find the Data set here: FIFA 2021 Dataset.
In this project, my main focus is on cleaning and preparing the raw data for analysis.

Need for Data Cleaning:

Data cleaning plays a crucial role in ensuring the accuracy and quality of the data we work with. By addressing missing values, inconsistencies, and errors within the dataset, we can trust the results of our analysis and make better and informed decisions.

Aim:

By the end of this project, I will have transformed the raw data into a clean and reliable dataset that can be used for further analysis. I invite you to join me on this journey of data cleaning as we dive into the FIFA 2021 dataset and uncover its hidden insights. So, let’s get started and explore the fascinating world of data cleaning in the context of the FIFA 2021 dataset. Together, we will ensure the data is accurate, consistent, and ready for analysis.

# Reading the csv file using read_csv method of pandas liabrary
import pandas as pd
raw_df = pd.read_csv('fifa21_raw_data_v2.csv', low_memory=False)

Creating a copy of the dataset:

It is always a good practice to make a copy of your dataset for backup if something gets wrong with the original one.

#making a copy of the dataframe
copy_df = raw_df.copy()

# looking for the dataframe
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 77 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ID                18979 non-null  int64 
 1   Name              18979 non-null  object
 2   LongName          18979 non-null  object
 3   photoUrl          18979 non-null  object
 4   playerUrl         18979 non-null  object
 5   Nationality       18979 non-null  object
 6   Age               18979 non-null  int64 
 7   ↓OVA              18979 non-null  int64 
 8   POT               18979 non-null  int64 
 9   Club              18979 non-null  object
 10  Contract          18979 non-null  object
 11  Positions         18979 non-null  object
 12  Height            18979 non-null  object
 13  Weight            18979 non-null  object
 14  Preferred Foot    18979 non-null  object
 15  BOV               18979 non-null  int64 
 16  Best Position     18979 non-null  object
 17  Joined            18979 non-null  object
 18  Loan Date End     1013 non-null   object
 19  Value             18979 non-null  object
 20  Wage              18979 non-null  object
 21  Release Clause    18979 non-null  object
 22  Attacking         18979 non-null  int64 
 23  Crossing          18979 non-null  int64 
 24  Finishing         18979 non-null  int64 
 25  Heading Accuracy  18979 non-null  int64 
 26  Short Passing     18979 non-null  int64 
 27  Volleys           18979 non-null  int64 
 28  Skill             18979 non-null  int64 
 29  Dribbling         18979 non-null  int64 
 30  Curve             18979 non-null  int64 
 31  FK Accuracy       18979 non-null  int64 
 32  Long Passing      18979 non-null  int64 
 33  Ball Control      18979 non-null  int64 
 34  Movement          18979 non-null  int64 
 35  Acceleration      18979 non-null  int64 
 36  Sprint Speed      18979 non-null  int64 
 37  Agility           18979 non-null  int64 
 38  Reactions         18979 non-null  int64 
 39  Balance           18979 non-null  int64 
 40  Power             18979 non-null  int64 
 41  Shot Power        18979 non-null  int64 
 42  Jumping           18979 non-null  int64 
 43  Stamina           18979 non-null  int64 
 44  Strength          18979 non-null  int64 
 45  Long Shots        18979 non-null  int64 
 46  Mentality         18979 non-null  int64 
 47  Aggression        18979 non-null  int64 
 48  Interceptions     18979 non-null  int64 
 49  Positioning       18979 non-null  int64 
 50  Vision            18979 non-null  int64 
 51  Penalties         18979 non-null  int64 
 52  Composure         18979 non-null  int64 
 53  Defending         18979 non-null  int64 
 54  Marking           18979 non-null  int64 
 55  Standing Tackle   18979 non-null  int64 
 56  Sliding Tackle    18979 non-null  int64 
 57  Goalkeeping       18979 non-null  int64 
 58  GK Diving         18979 non-null  int64 
 59  GK Handling       18979 non-null  int64 
 60  GK Kicking        18979 non-null  int64 
 61  GK Positioning    18979 non-null  int64 
 62  GK Reflexes       18979 non-null  int64 
 63  Total Stats       18979 non-null  int64 
 64  Base Stats        18979 non-null  int64 
 65  W/F               18979 non-null  object
 66  SM                18979 non-null  object
 67  A/W               18979 non-null  object
 68  D/W               18979 non-null  object
 69  IR                18979 non-null  object
 70  PAC               18979 non-null  int64 
 71  SHO               18979 non-null  int64 
 72  PAS               18979 non-null  int64 
 73  DRI               18979 non-null  int64 
 74  DEF               18979 non-null  int64 
 75  PHY               18979 non-null  int64 
 76  Hits              16384 non-null  object
dtypes: int64(54), object(23)
memory usage: 11.1+ MB

Checking for null values:

Checking the number of null or NaN (Not a Number) values in a dataset is an essential practice in data cleaning. It allows us to identify and understand the quality and completeness of our data. By examining the null or NaN values, we can gain insights into missing or incomplete information within the dataset.

By knowing the number of null or NaN values, we can take appropriate actions to handle them. We can choose to remove or impute missing values, based on the context and nature of the data. This helps in maintaining the integrity and consistency of the dataset.

raw_df.isnull().sum().head(46)

ID                      0
Name                    0
LongName                0
photoUrl                0
playerUrl               0
Nationality             0
Age                     0
↓OVA                    0
POT                     0
Club                    0
Contract                0
Positions               0
Height                  0
Weight                  0
Preferred Foot          0
BOV                     0
Best Position           0
Joined                  0
Loan Date End       17966
Value                   0
Wage                    0
Release Clause          0
Attacking               0
Crossing                0
Finishing               0
Heading Accuracy        0
Short Passing           0
Volleys                 0
Skill                   0
Dribbling               0
Curve                   0
FK Accuracy             0
Long Passing            0
Ball Control            0
Movement                0
Acceleration            0
Sprint Speed            0
Agility                 0
Reactions               0
Balance                 0
Power                   0
Shot Power              0
Jumping                 0
Stamina                 0
Strength                0
Long Shots              0
dtype: int64

raw_df.isnull().sum().tail(30)


Aggression            0
Interceptions         0
Positioning           0
Vision                0
Penalties             0
Composure             0
Defending             0
Marking               0
Standing Tackle       0
Sliding Tackle        0
Goalkeeping           0
GK Diving             0
GK Handling           0
GK Kicking            0
GK Positioning        0
GK Reflexes           0
Total Stats           0
Base Stats            0
W/F                   0
SM                    0
A/W                   0
D/W                   0
IR                    0
PAC                   0
SHO                   0
PAS                   0
DRI                   0
DEF                   0
PHY                   0
Hits               2595
dtype: int64

We can notice that we have a huge number of null values in the column named Loan Date End something around 17000+ but we can not drop the column on this basis only as the column contains necessary information.

raw_df['playerUrl'].head()

0    http://sofifa.com/player/158023/lionel-messi/2...
1    http://sofifa.com/player/20801/c-ronaldo-dos-s...
2    http://sofifa.com/player/200389/jan-oblak/210006/
3    http://sofifa.com/player/192985/kevin-de-bruyn...
4    http://sofifa.com/player/190871/neymar-da-silv...
Name: playerUrl, dtype: object

raw_df['photoUrl'].head()

0    https://cdn.sofifa.com/players/158/023/21_60.png
1    https://cdn.sofifa.com/players/020/801/21_60.png
2    https://cdn.sofifa.com/players/200/389/21_60.png
3    https://cdn.sofifa.com/players/192/985/21_60.png
4    https://cdn.sofifa.com/players/190/871/21_60.png
Name: photoUrl, dtype: object

By looking over this data we can say that they are no use for us in this project as we can not analyze the URL links so we can drop these columns their is no loss in this.

raw_df = raw_df.drop(columns=['photoUrl', 'playerUrl'])

Removing special characters:

Removing special characters and extra spaces from a dataset is a crucial step in data preprocessing. It helps to ensure data cleanliness, consistency, and improves the quality of subsequent analysis. Special characters, such as punctuation marks or symbols, also extra spaces may not be relevant to the analysis or modeling process. They can introduce noise and unwanted variations in the data. By removing them, we can focus on the essential information within the dataset.

By looking over the data we found that in some columns . and , are important so we will not remove these but also we have … which is of no use so we will remove that.

We will use a python module named re to recognize pattern and replace the special characters with nothing.

import re

# replacing special characters except . and ,
raw_df = raw_df.replace(r'[^a-zA-Z0-9.,]', '', regex=True)

# replacing extra spaces
raw_df = raw_df.replace(r'\s+', '', regex=True)

# replacing 3 cosicutive...
raw_df = raw_df.replace(r'\.{3}', '', regex=True)

r’[^a-zA-Z0–9.,]’ matches any non-alphanumeric character and other than . and ,. and replace() function replaces those with nothing so that they are removed.
r’\s+’ matches one or more whitespace characters. and replace() function replaces those with nothing so that they are removed.
r’.{3}’ matches three consecutive periods. and replace() function replaces those with nothing so that they are removed.

raw_df[['Hits','Value', 'Wage', 'Release Clause']]


Hits  Value Wage  Release Clause
0      771  103.5M 560K   138.4M
1      562  63M    220K   75.9M
2      150  120M   125K   159.4M
3      207  129M   370K   161M
4      595  132M   270K   166.5M
...    ...  ...    ...    ...
18974  NaN  100K   1K     70K
18975  NaN  130K   500    165K
18976  NaN  120K   500    131K
18977  NaN  100K   2K     88K
18978  NaN  100K   1K     79K
18979 rows × 4 columns

Changing data types & processing data:

By looking over these columns we can understand why we needed the special character . as it denotes values like 1.6K and 75.9M.

Now we need to convert these values into 16000 and 75900000 respectively, but to do so we need to convert the data type to string first.

# this will change the data to bitwise
raw_df['Hits'] = raw_df['Hits'].astype('|S')
raw_df['Value'] = raw_df['Value'].astype('|S')
raw_df['Wage'] = raw_df['Wage'].astype('|S')
raw_df['Release Clause'] = raw_df['Release Clause'].astype('|S')

# this will decode the bitwise data to original string type
raw_df['Hits']= raw_df['Hits'].str.decode('utf-8')
raw_df['Value']= raw_df['Value'].str.decode('utf-8')
raw_df['Wage']= raw_df['Wage'].str.decode('utf-8')
raw_df['Release Clause']= raw_df['Release Clause'].str.decode('utf-8')

After this we can iterate over the data also can apply string like functions and methods.

now we are writing our logic to removed the suffix K and M to be multiplied with 1000 and 1000000 respectively.

# Creating a function named Convert_to_numeric
def convert_to_numeric(value):
    
    # check the case having K
    if value[-1] == 'K':
        value=value[:-1] # removing suffix K
        value=float(value)*1000 # multiplying the value with 1000
        return int(value) # returning as integer
    
    # checking the case having M
    elif value[-1] == 'M':
        value=value[:-1] # removing suffix M
        value=float(value)*1000000 # multiplying the value with 1000000
        return int(value) # returning as integer
    
    # checking the case with nan value
    elif value == 'nan':
        return 0 # returning 0
    
    # checking case with not suffic and not nan value
    else:
        return int(value) # returning the same value as integer

# applying the function
raw_df['Hits'] = raw_df['Hits'].apply(convert_to_numeric)
raw_df['Hits'].head(10)

0     771
1     562
2     150
3     207
4     595
5     248
6     246
7     120
8    1600
9     130
Name: Hits, dtype: int64

# applying the function
raw_df['Value'] = raw_df['Value'].apply(convert_to_numeric)
raw_df['Value'].head(10)

0    103500000
1     63000000
2    120000000
3    129000000
4    132000000
5    111000000
6    120500000
7    102000000
8    185500000
9    110000000
Name: Value, dtype: int64

# applying the function
raw_df['Wage'] = raw_df['Wage'].apply(convert_to_numeric)
raw_df['Wage'].head(10)

0    560000
1    220000
2    125000
3    370000
4    270000
5    240000
6    250000
7    160000
8    160000
9    260000
Name: Wage, dtype: int64

# applying the function
raw_df['Release Clause'] = raw_df['Release Clause'].apply(convert_to_numeric)
raw_df['Release Clause'].head(10)

0    138400000
1     75900000
2    159400000
3    161000000
4    166500000
5    132000000
6    144300000
7    120300000
8    203100000
9    147700000
Name: Release Clause, dtype: int64

raw_df[['Hits', 'Value', 'Wage', 'Release Clause']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Hits            18979 non-null  int64
 1   Value           18979 non-null  int64
 2   Wage            18979 non-null  int64
 3   Release Clause  18979 non-null  int64
dtypes: int64(4)
memory usage: 593.2 KB

Now as we can see we have succesfully changed the format and also correctly processed the data. and hence it is done now we will focus on labeled encoding if required.

Label encoding:

“label code” typically refers to a numerical or categorical representation assigned to a specific label or category within a dataset. It is often used when dealing with categorical variables. For example: in a dataset with a “color” variable containing labels like “red,” “green,” and “blue,” label codes may assign the values 0, 1, and 2 respectively to represent each category.

Label encoding is a common technique used in data preprocessing to convert categorical data into numerical format.

raw_df['Preferred Foot'].value_counts()

Right    14445
Left      4534
Name: Preferred Foot, dtype: int64

Here we can see the column Preferred Foot have only 2 value Right & Left so we can label encode it with 0 & 1 respectively. but before doing this we firstly need to convert the Data type of the column to Strings.

# this will change the data to bitwise
raw_df['Preferred Foot'] = raw_df['Preferred Foot'].astype('|S')

# this will decode the bitwise data to original string type
raw_df['Preferred Foot'] = raw_df['Preferred Foot'].str.decode('utf-8')

mapping_preferred_foot = {'Right': 0, 'Left': 1}
raw_df['Preferred Foot'] = raw_df['Preferred Foot'].map(mapping_preferred_foot)

We are creating a dicteonery mapping_preferred_foot which contains the labels for Preferred foot column and then using the map function to map the values over the column and replace it with the respective label.

raw_df['Preferred Foot'].value_counts()

0    14445
1     4534
Name: Preferred Foot, dtype: int64

As we can see this encoded the labels as 0 for Right and 1 for left. Also the number of value are as same as before(14445 Rights and 4534 Lefts).

Now lets look over other columns which have multiple attributed values.

raw_df['Best Position'].value_counts()

CB     3686
ST     2680
CAM    2299
GK     2075
RM     1611
CDM    1445
LB     1086
RB     1079
CM     1047
LM      871
RW      298
RWB     277
LWB     261
LW      186
CF       78
Name: Best Position, dtype: int64

raw_df['A/W'].value_counts()

Medium    12701
High       5288
Low         990
Name: A/W, dtype: int64

raw_df['D/W'].value_counts()

Medium    13956
High       3297
Low        1726
Name: D/W, dtype: int64

So by looking for the value counts of the above columns we can conclude the decision to do the label encoding as the distinct values are very less in number. But on the other hands the columns below can not be Label encoded as the simple reason behind this is many distinct values.

raw_df['Nationality'].value_counts()

England         1705
Germany         1195
Spain           1065
France          1003
Argentina        943
                ... 
Malawi             1
Rwanda             1
SoTomPrncipe       1
Aruba              1
Indonesia          1
Name: Nationality, Length: 164, dtype: int64

Converting the columns data type into string so that we can label encode them.

# this will change the data to bitwise
raw_df['Best Position'] = raw_df['Best Position'].astype('|S')
raw_df['A/W'] = raw_df['A/W'].astype('|S')
raw_df['D/W'] = raw_df['D/W'].astype('|S')

# this will decode the bitwise data to original string type
raw_df['Best Position'] = raw_df['Best Position'].str.decode('utf-8')
raw_df['A/W'] = raw_df['A/W'].str.decode('utf-8')
raw_df['D/W'] = raw_df['D/W'].str.decode('utf-8')

mapping_positions = {
'CB': 0,
'ST': 1,
'CAM': 2,
'GK': 3,
'RM': 4,
'CDM': 5,
'LB': 6,
'RB': 7,
'CM': 8,
'LM': 9,
'RW': 10,
'RWB': 11,
'LWB': 12,
'LW': 13,
'CF': 14
}
raw_df['Best Position'] = raw_df['Best Position'].map(mapping_positions)

Mapping all the destinct values and assigning related label for them and using map function we have replaced all the values with the respective labels.

raw_df['Best Position'].value_counts()

0     3686
1     2680
2     2299
3     2075
4     1611
5     1445
6     1086
7     1079
8     1047
9      871
10     298
11     277
12     261
13     186
14      78
Name: Best Position, dtype: int64

here we can see that we have converted the values of Best Position into label endoded data. But we also have a column named Positions ahving more than one position in a single value data. lets take a look over this column.

raw_df['Positions'].head()

0    RW,ST,CF
1       ST,LW
2          GK
3      CAM,CM
4      LW,CAM
Name: Positions, dtype: object

So as you can see we have this type of data but we want the label encoded data so we will convert it into labels seprated with ,. to do so we need to firstly convert it into string.

raw_df['Positions']=raw_df['Positions'].astype('|S')
raw_df['Positions']=raw_df['Positions'].str.decode('utf-8')

raw_df['Positions'] = raw_df['Positions'].map(lambda x: ','.join(str(mapping_positions.get(p)) for p in x.split(',')))

Here we used a lambda function to pass each value as x then use split to have a list of splited position on the basis of delimiter ,. then used a for loop to pass the value to map function to map the value from mapping_positions we already created then just converting it into string and used join to rejoin the splited positions with , and the labels.

raw_df['Positions'].head()

0    10,1,14
1       1,13
2          3
3        2,8
4       13,2
Name: Positions, dtype: object

Here you can see the positions has been replaced with the respective labels. now we are labeling the columns A/W and D/W column, it is same as we did with the column Best Position.

mapping_intencity = {'High': 0, 'Medium': 1, 'Low': 2}
raw_df['A/W'] = raw_df['A/W'].map(mapping_intencity)

raw_df['A/W'].value_counts()

1    12701
0     5288
2      990
Name: A/W, dtype: int64

A/W column is successfully label encoded.

raw_df['D/W'] = raw_df['D/W'].map(mapping_intencity)

raw_df['D/W'].value_counts()

1    13956
0     3297
2     1726
Name: D/W, dtype: int64

So as we can see above we have succesfully label encoded the columns.
now we are converting some column data type into string and integer with respect to their values and use case.

Converting into right Data type:

# changing Name column to string data type
raw_df['Name']=raw_df['Name'].astype('|S')
raw_df['Name']=raw_df['Name'].str.decode('utf-8')

# changing LongName column to string data type
raw_df['LongName']=raw_df['LongName'].astype('|S')
raw_df['LongName']=raw_df['LongName'].str.decode('utf-8')

# changing Nationality column to string data type
raw_df['Nationality']=raw_df['Nationality'].astype('|S')
raw_df['Nationality']=raw_df['Nationality'].str.decode('utf-8')

# changing Club column to string data type
raw_df['Club']=raw_df['Club'].astype('|S')
raw_df['Club']=raw_df['Club'].str.decode('utf-8')

# changing Weight column to string data type
raw_df['Weight']=raw_df['Weight'].astype('|S')
raw_df['Weight']=raw_df['Weight'].str.decode('utf-8')

# changing W/F column to integer data type
raw_df['W/F']=raw_df['W/F'].astype(int)

# changing SM column to integer data type
raw_df['SM']=raw_df['SM'].astype(int)

# changing IR column to integer data type
raw_df['IR']=raw_df['IR'].astype(int)

Lets see the Weight column what types of destinct values we have.

raw_df['Weight'].value_counts()

70kg      1495
75kg      1457
80kg      1108
72kg      1022
78kg       991
          ... 
190lbs       1
130lbs       1
146lbs       1
203lbs       1
157lbs       1
Name: Weight, Length: 79, dtype: int64

As we can see that we have 2 types of values here kg one and lbs, but we want the values with kg only so we will write a function to convert the values into kg and then into integer.

Processing data to convert pound into kg:

# Creating a function named Convert_to_kg
def convert_to_kg(value):
    
    # check the case having kg
    if value[-1] == 'g':
        value=value[:-2] # removing suffix kg
        return int(value) # returning as integer
    
    # checking the case having lbs
    elif value[-1] == 's':
        value=value[:-3] # removing suffix lbs
        value=float(value)*0.453592 # multiplying the value with 0.453592
        return int(value) # returning as integer
    
    # checking the case with nan value
    elif value == 'nan':
        return 0 # returning 0

# lets just use the function
raw_df['Weight']=raw_df['Weight'].apply(convert_to_kg)

# now we will change the column name as Weight_in_kg
raw_df = raw_df.rename(columns={'Weight': 'Weight_in_kg'})

After converting all values into kg and then integer we have changed the name of column as Weight_in_kg.

raw_df['Weight_in_kg'].head()

0    72
1    83
2    87
3    70
4    68
Name: Weight_in_kg, dtype: int64

raw_df['Weight_in_kg'].tail()

18974    66
18975    65
18976    74
18977    69
18978    75
Name: Weight_in_kg, dtype: int64

So by looking over this we can say we successfully processed the data. But in case of the column Height we can see a mistake.

raw_df['Height'].tail()

18974    178cm
18975    175cm
18976    179cm
18977    175cm
18978    188cm
Name: Height, dtype: object

copy_df['Height'].tail()

18974    178cm
18975    175cm
18976    179cm
18977    175cm
18978    188cm
Name: Height, dtype: object

Because we already removed the special character we also removed ‘ and “ so to fix this we will drop the column and again take it from the copy_df data frame.

Dropping and using copy_df:

# dropping the column Height
raw_df = raw_df.drop(columns='Height')

# taking the Column Height from copy_df with the name Height_in_cm
raw_df['Height_in_cm'] = copy_df['Height']

We already coverted the name of the column as Height_in_cm as we want all the values in cm. Now we will write a function to split the feet and inch then convert it cm and then return as integer.
But first things first we need to convert the data type into string.

# changing Weight column to string data type
raw_df['Height_in_cm']=raw_df['Height_in_cm'].astype('|S')
raw_df['Height_in_cm']=raw_df['Height_in_cm'].str.decode('utf-8')

Pocessing data to convert feet-inch into cm:

# Creating a function named Convert_to_cm
def convert_to_cm(value):
    
    # check the case having cm
    if value[-1] == 'm':
        value=value[:-2] # removing suffix cm
        return int(value) # returning as integer
    
    # checking the case having "
    elif value[-1] == '"':
        value=value[:-1] # removing suffix "
        split_height = value.split("'")
        feet_in_cm=float(split_height[0])*30.48 # multiplying the feet value with 30.48 to convert into cm
        inch_in_cm=float(split_height[1])*2.54 # multiplying the inch value with 2.54 to convert into cm
        value=feet_in_cm+inch_in_cm
        return int(value) # returning as integer
    
    # checking the case with nan value
    elif value == 'nan':
        return 0 # returning 0

# applying the function to convert all the values into cm
raw_df['Height_in_cm']=raw_df['Height_in_cm'].apply(convert_to_cm)

raw_df['Height_in_cm'].tail()

18974    178
18975    175
18976    179
18977    175
18978    188
Name: Height_in_cm, dtype: int64

As we can see we have processed the data and converted all the values into cm and integers. Now we will take a look over the column named Contract just like the column Height we had made a mistake with this column.

raw_df['Contract'].head()

0    20042021
1    20182022
2    20142023
3    20152023
4    20172022
Name: Contract, dtype: object

copy_df['Contract'].head()

0    2004 ~ 2021
1    2018 ~ 2022
2    2014 ~ 2023
3    2015 ~ 2023
4    2017 ~ 2022
Name: Contract, dtype: object

As we can see the difference due to removing of special characters we also removed ~ and thus we made a mistake so for solving this we will drop this column and again take it from the copy data frame.

# dropping the column Contract from raw_df
raw_df = raw_df.drop(columns=['Contract'])

# coping the column Contract from copy_df
raw_df['Contract'] = copy_df['Contract']

raw_df['Contract'].head()

0    2004 ~ 2021
1    2018 ~ 2022
2    2014 ~ 2023
3    2015 ~ 2023
4    2017 ~ 2022
Name: Contract, dtype: object

Now as we have the correct data we will convert it into strings.

# changing Contract column to string data type
raw_df['Contract']=raw_df['Contract'].astype('|S')
raw_df['Contract']=raw_df['Contract'].str.decode('utf-8')

Okay after this lets take a look over the column Joined this tells the joining date of the player.

raw_df['Joined'].head()

0     Jul1,2004
1    Jul10,2018
2    Jul16,2014
3    Aug30,2015
4     Aug3,2017
Name: Joined, dtype: object

As we can see the data type is not appropriate so we will convert it into Datetime.

Converting data type to datetime:

raw_df['Joined'] = raw_df['Joined'].apply(lambda x: pd.to_datetime(x, format='%b%d,%Y'))

We used a lambda function and converted all the values into Datetime data type with the help of method named to_datetime in pandas module.

raw_df['Joined'].head()

0   2004-07-01
1   2018-07-10
2   2014-07-16
3   2015-08-30
4   2017-08-03
Name: Joined, dtype: datetime64[ns]

raw_df['Joined'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 18979 entries, 0 to 18978
Series name: Joined
Non-Null Count  Dtype         
--------------  -----         
18979 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 148.4 KB

But we do not want this we wanted to extract the date, month and year into seperate columns.

Extracting date, month & year into new column:

raw_df['Joined_Date'] = raw_df['Joined'].dt.day
raw_df['Joined_Month'] = raw_df['Joined'].dt.month
raw_df['Joined_Year'] = raw_df['Joined'].dt.year

Here we have created columns named Joined_Date,Joined_Month and Joined_Year and extracted the date, month and year using the methods dt.day,dt.month and dt.year repectively.

raw_df['Joined_Date'].head()

0     1
1    10
2    16
3    30
4     3
Name: Joined_Date, dtype: int64

raw_df['Joined_Month'].head()

0    7
1    7
2    7
3    8
4    8
Name: Joined_Month, dtype: int64

raw_df['Joined_Year'].head()

0    2004
1    2018
2    2014
3    2015
4    2017
Name: Joined_Year, dtype: int64

now as we can see we have the date, month and year of joining in saprated column we do not further need the column Joined so we will drop this column.

raw_df = raw_df.drop(columns=['Joined'])

Also changing the data type of Loan Date End column to Datetime.

raw_df['Loan Date End'] = raw_df['Loan Date End'].apply(lambda x: pd.to_datetime(x, format='%b%d,%Y'))

Now lets look over the raw_df again.

raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 77 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   ID                18979 non-null  int64         
 1   Name              18979 non-null  object        
 2   LongName          18979 non-null  object        
 3   Nationality       18979 non-null  object        
 4   Age               18979 non-null  int64         
 5   ↓OVA              18979 non-null  int64         
 6   POT               18979 non-null  int64         
 7   Club              18979 non-null  object        
 8   Positions         18979 non-null  object        
 9   Weight_in_kg      18979 non-null  int64         
 10  Preferred Foot    18979 non-null  int64         
 11  BOV               18979 non-null  int64         
 12  Best Position     18979 non-null  int64         
 13  Loan Date End     1013 non-null   datetime64[ns]
 14  Value             18979 non-null  int64         
 15  Wage              18979 non-null  int64         
 16  Release Clause    18979 non-null  int64         
 17  Attacking         18979 non-null  int64         
 18  Crossing          18979 non-null  int64         
 19  Finishing         18979 non-null  int64         
 20  Heading Accuracy  18979 non-null  int64         
 21  Short Passing     18979 non-null  int64         
 22  Volleys           18979 non-null  int64         
 23  Skill             18979 non-null  int64         
 24  Dribbling         18979 non-null  int64         
 25  Curve             18979 non-null  int64         
 26  FK Accuracy       18979 non-null  int64         
 27  Long Passing      18979 non-null  int64         
 28  Ball Control      18979 non-null  int64         
 29  Movement          18979 non-null  int64         
 30  Acceleration      18979 non-null  int64         
 31  Sprint Speed      18979 non-null  int64         
 32  Agility           18979 non-null  int64         
 33  Reactions         18979 non-null  int64         
 34  Balance           18979 non-null  int64         
 35  Power             18979 non-null  int64         
 36  Shot Power        18979 non-null  int64         
 37  Jumping           18979 non-null  int64         
 38  Stamina           18979 non-null  int64         
 39  Strength          18979 non-null  int64         
 40  Long Shots        18979 non-null  int64         
 41  Mentality         18979 non-null  int64         
 42  Aggression        18979 non-null  int64         
 43  Interceptions     18979 non-null  int64         
 44  Positioning       18979 non-null  int64         
 45  Vision            18979 non-null  int64         
 46  Penalties         18979 non-null  int64         
 47  Composure         18979 non-null  int64         
 48  Defending         18979 non-null  int64         
 49  Marking           18979 non-null  int64         
 50  Standing Tackle   18979 non-null  int64         
 51  Sliding Tackle    18979 non-null  int64         
 52  Goalkeeping       18979 non-null  int64         
 53  GK Diving         18979 non-null  int64         
 54  GK Handling       18979 non-null  int64         
 55  GK Kicking        18979 non-null  int64         
 56  GK Positioning    18979 non-null  int64         
 57  GK Reflexes       18979 non-null  int64         
 58  Total Stats       18979 non-null  int64         
 59  Base Stats        18979 non-null  int64         
 60  W/F               18979 non-null  int32         
 61  SM                18979 non-null  int32         
 62  A/W               18979 non-null  int64         
 63  D/W               18979 non-null  int64         
 64  IR                18979 non-null  int32         
 65  PAC               18979 non-null  int64         
 66  SHO               18979 non-null  int64         
 67  PAS               18979 non-null  int64         
 68  DRI               18979 non-null  int64         
 69  DEF               18979 non-null  int64         
 70  PHY               18979 non-null  int64         
 71  Hits              18979 non-null  int64         
 72  Height_in_cm      18979 non-null  int64         
 73  Contract          18979 non-null  object        
 74  Joined_Date       18979 non-null  int64         
 75  Joined_Month      18979 non-null  int64         
 76  Joined_Year       18979 non-null  int64         
dtypes: datetime64[ns](1), int32(3), int64(67), object(6)
memory usage: 10.9+ MB

So we finally completed the data cleaning for the raw_df.
Also used the copy_df as we meesed up some columns like Height and Contract.

Now lets create the mapping data frame we used to map the values in the process of label encoding, for future as we might forget the labels or for someone who will reffer this data.

Positions_df = pd.DataFrame.from_dict(mapping_positions, orient='index', columns=['Position_Code'])

Positions_df

    Position_Code
CB        0
ST        1
CAM       2
GK        3
RM        4
CDM       5
LB        6
RB        7
CM        8
LM        9
RW        10
RWB       11
LWB       12
LW        13
CF        14

Here we have created a data frame named Positions_df we used the function from_dict to pass the dictionery and used the orient=’index’ so the index of the passed dictionery will be treated as index for this data frame and lastly we named the column containing label_code as Position_Code.

Intencity_df = pd.DataFrame.from_dict(mapping_intencity, orient='index', columns=['intencity_Code'])

Intencity_df

    intencity_Code
High          0
Medium        1
Low           2

Simillarly we have created another data frame named Intencity_df which contains the labels for columns A/W & D/W.

Saving the cleaned data into csv files:

# exporting the data as data.csv 
raw_.to_csv('data.csv', index=False)

# exporting the Intencity_df as Intencity_label.csv
Intencity_df.to_csv('Intencity_label.csv', index=False)

# exporting the Positions_df as Positions_label.csv
Positions_df.to_csv('Positions_label.csv', index=False)

We have used to_csv method to export the data frame as csv and used index=’False’ as we do not wanted to have index in our csv file.

Closing the project:

In my data cleaning project, I utilized pandas to clean the Fifa 2021 dataset obtained from Kaggle. To ensure data integrity, I started by making a copy of the dataset, allowing me to revert any changes if needed. Throughout the cleaning process, I encountered several challenges with specific columns but successfully recovered the original data using the copied DataFrame.

One of the key tasks involved transforming data types to their appropriate formats, ensuring consistency and accuracy. I also identified and dropped unnecessary columns, streamlining the dataset for further analysis. In addition, I extracted valuable information from the ‘joined’ column, creating new columns for the date, month, and year of joining.

To enhance data analysis capabilities, I applied label encoding to convert categorical labels into numerical values. Furthermore, I performed data processing operations, such as converting weights from pounds to kilograms and heights from feet and inches to centimeters. I leveraged the re module to efficiently remove unwanted characters and ensure data cleanliness.

Overall, this data cleaning project has been instrumental in preparing the Fifa 2021 dataset for advanced analysis and insights. The transformed and refined dataset now sets the stage for more comprehensive exploration and data-driven decision-making.

Acknowledgement:

To all those who have followed this project, I sincerely hope that you have found it informative and engaging. Through this project, I aimed to not only clean and prepare the FIFA 2021 dataset but also inspire you to explore the world of data analysis and visualization further.

I believe that this project has provided valuable insights into the data cleaning process and showcased the power of pandas in handling and transforming datasets. I encourage you to continue this project by conducting your own analysis and visualization. There is so much more to explore within the dataset, and by employing different analytical methods and visualization techniques, you can gain deeper insights and make meaningful discoveries.

Once again, thank you for following this project and being a part of this learning journey.