Pandas for Data Cleaning:
You can get the Project code here: GitHub
Check out my Portfolio: Portfolio Website
Project: Data Cleaning
Welcome to my data cleaning project, where I will be working with the FIFA 2021 dataset obtained from Kaggle.
You can find the Data set here: FIFA 2021 Dataset.
In this project, my main focus is on cleaning and preparing the raw data for analysis.
Need for Data Cleaning:
Data cleaning plays a crucial role in ensuring the accuracy and quality of the data we work with. By addressing missing values, inconsistencies, and errors within the dataset, we can trust the results of our analysis and make better and informed decisions.
Aim:
By the end of this project, I will have transformed the raw data into a clean and reliable dataset that can be used for further analysis. I invite you to join me on this journey of data cleaning as we dive into the FIFA 2021 dataset and uncover its hidden insights. So, let’s get started and explore the fascinating world of data cleaning in the context of the FIFA 2021 dataset. Together, we will ensure the data is accurate, consistent, and ready for analysis.
# Reading the csv file using read_csv method of pandas liabrary
import pandas as pd
raw_df = pd.read_csv('fifa21_raw_data_v2.csv', low_memory=False)
Creating a copy of the dataset:
It is always a good practice to make a copy of your dataset for backup if something gets wrong with the original one.
#making a copy of the dataframe
copy_df = raw_df.copy()
# looking for the dataframe
raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 77 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 18979 non-null int64
1 Name 18979 non-null object
2 LongName 18979 non-null object
3 photoUrl 18979 non-null object
4 playerUrl 18979 non-null object
5 Nationality 18979 non-null object
6 Age 18979 non-null int64
7 ↓OVA 18979 non-null int64
8 POT 18979 non-null int64
9 Club 18979 non-null object
10 Contract 18979 non-null object
11 Positions 18979 non-null object
12 Height 18979 non-null object
13 Weight 18979 non-null object
14 Preferred Foot 18979 non-null object
15 BOV 18979 non-null int64
16 Best Position 18979 non-null object
17 Joined 18979 non-null object
18 Loan Date End 1013 non-null object
19 Value 18979 non-null object
20 Wage 18979 non-null object
21 Release Clause 18979 non-null object
22 Attacking 18979 non-null int64
23 Crossing 18979 non-null int64
24 Finishing 18979 non-null int64
25 Heading Accuracy 18979 non-null int64
26 Short Passing 18979 non-null int64
27 Volleys 18979 non-null int64
28 Skill 18979 non-null int64
29 Dribbling 18979 non-null int64
30 Curve 18979 non-null int64
31 FK Accuracy 18979 non-null int64
32 Long Passing 18979 non-null int64
33 Ball Control 18979 non-null int64
34 Movement 18979 non-null int64
35 Acceleration 18979 non-null int64
36 Sprint Speed 18979 non-null int64
37 Agility 18979 non-null int64
38 Reactions 18979 non-null int64
39 Balance 18979 non-null int64
40 Power 18979 non-null int64
41 Shot Power 18979 non-null int64
42 Jumping 18979 non-null int64
43 Stamina 18979 non-null int64
44 Strength 18979 non-null int64
45 Long Shots 18979 non-null int64
46 Mentality 18979 non-null int64
47 Aggression 18979 non-null int64
48 Interceptions 18979 non-null int64
49 Positioning 18979 non-null int64
50 Vision 18979 non-null int64
51 Penalties 18979 non-null int64
52 Composure 18979 non-null int64
53 Defending 18979 non-null int64
54 Marking 18979 non-null int64
55 Standing Tackle 18979 non-null int64
56 Sliding Tackle 18979 non-null int64
57 Goalkeeping 18979 non-null int64
58 GK Diving 18979 non-null int64
59 GK Handling 18979 non-null int64
60 GK Kicking 18979 non-null int64
61 GK Positioning 18979 non-null int64
62 GK Reflexes 18979 non-null int64
63 Total Stats 18979 non-null int64
64 Base Stats 18979 non-null int64
65 W/F 18979 non-null object
66 SM 18979 non-null object
67 A/W 18979 non-null object
68 D/W 18979 non-null object
69 IR 18979 non-null object
70 PAC 18979 non-null int64
71 SHO 18979 non-null int64
72 PAS 18979 non-null int64
73 DRI 18979 non-null int64
74 DEF 18979 non-null int64
75 PHY 18979 non-null int64
76 Hits 16384 non-null object
dtypes: int64(54), object(23)
memory usage: 11.1+ MB
Checking for null values:
Checking the number of null or NaN (Not a Number) values in a dataset is an essential practice in data cleaning. It allows us to identify and understand the quality and completeness of our data. By examining the null or NaN values, we can gain insights into missing or incomplete information within the dataset.
By knowing the number of null or NaN values, we can take appropriate actions to handle them. We can choose to remove or impute missing values, based on the context and nature of the data. This helps in maintaining the integrity and consistency of the dataset.
raw_df.isnull().sum().head(46)
ID 0
Name 0
LongName 0
photoUrl 0
playerUrl 0
Nationality 0
Age 0
↓OVA 0
POT 0
Club 0
Contract 0
Positions 0
Height 0
Weight 0
Preferred Foot 0
BOV 0
Best Position 0
Joined 0
Loan Date End 17966
Value 0
Wage 0
Release Clause 0
Attacking 0
Crossing 0
Finishing 0
Heading Accuracy 0
Short Passing 0
Volleys 0
Skill 0
Dribbling 0
Curve 0
FK Accuracy 0
Long Passing 0
Ball Control 0
Movement 0
Acceleration 0
Sprint Speed 0
Agility 0
Reactions 0
Balance 0
Power 0
Shot Power 0
Jumping 0
Stamina 0
Strength 0
Long Shots 0
dtype: int64
raw_df.isnull().sum().tail(30)
Aggression 0
Interceptions 0
Positioning 0
Vision 0
Penalties 0
Composure 0
Defending 0
Marking 0
Standing Tackle 0
Sliding Tackle 0
Goalkeeping 0
GK Diving 0
GK Handling 0
GK Kicking 0
GK Positioning 0
GK Reflexes 0
Total Stats 0
Base Stats 0
W/F 0
SM 0
A/W 0
D/W 0
IR 0
PAC 0
SHO 0
PAS 0
DRI 0
DEF 0
PHY 0
Hits 2595
dtype: int64
We can notice that we have a huge number of null values in the column named Loan Date End something around 17000+ but we can not drop the column on this basis only as the column contains necessary information.
raw_df['playerUrl'].head()
0 http://sofifa.com/player/158023/lionel-messi/2...
1 http://sofifa.com/player/20801/c-ronaldo-dos-s...
2 http://sofifa.com/player/200389/jan-oblak/210006/
3 http://sofifa.com/player/192985/kevin-de-bruyn...
4 http://sofifa.com/player/190871/neymar-da-silv...
Name: playerUrl, dtype: object
raw_df['photoUrl'].head()
0 https://cdn.sofifa.com/players/158/023/21_60.png
1 https://cdn.sofifa.com/players/020/801/21_60.png
2 https://cdn.sofifa.com/players/200/389/21_60.png
3 https://cdn.sofifa.com/players/192/985/21_60.png
4 https://cdn.sofifa.com/players/190/871/21_60.png
Name: photoUrl, dtype: object
By looking over this data we can say that they are no use for us in this project as we can not analyze the URL links so we can drop these columns their is no loss in this.
raw_df = raw_df.drop(columns=['photoUrl', 'playerUrl'])
Removing special characters:
Removing special characters and extra spaces from a dataset is a crucial step in data preprocessing. It helps to ensure data cleanliness, consistency, and improves the quality of subsequent analysis. Special characters, such as punctuation marks or symbols, also extra spaces may not be relevant to the analysis or modeling process. They can introduce noise and unwanted variations in the data. By removing them, we can focus on the essential information within the dataset.
By looking over the data we found that in some columns . and , are important so we will not remove these but also we have … which is of no use so we will remove that.
We will use a python module named re to recognize pattern and replace the special characters with nothing.
import re
# replacing special characters except . and ,
raw_df = raw_df.replace(r'[^a-zA-Z0-9.,]', '', regex=True)
# replacing extra spaces
raw_df = raw_df.replace(r'\s+', '', regex=True)
# replacing 3 cosicutive...
raw_df = raw_df.replace(r'\.{3}', '', regex=True)
- r’[^a-zA-Z0–9.,]’ matches any non-alphanumeric character and other than . and ,. and replace() function replaces those with nothing so that they are removed.
- r’\s+’ matches one or more whitespace characters. and replace() function replaces those with nothing so that they are removed.
- r’.{3}’ matches three consecutive periods. and replace() function replaces those with nothing so that they are removed.
raw_df[['Hits','Value', 'Wage', 'Release Clause']]
Hits Value Wage Release Clause
0 771 103.5M 560K 138.4M
1 562 63M 220K 75.9M
2 150 120M 125K 159.4M
3 207 129M 370K 161M
4 595 132M 270K 166.5M
... ... ... ... ...
18974 NaN 100K 1K 70K
18975 NaN 130K 500 165K
18976 NaN 120K 500 131K
18977 NaN 100K 2K 88K
18978 NaN 100K 1K 79K
18979 rows × 4 columns
Changing data types & processing data:
By looking over these columns we can understand why we needed the special character . as it denotes values like 1.6K and 75.9M.
Now we need to convert these values into 16000 and 75900000 respectively, but to do so we need to convert the data type to string first.
# this will change the data to bitwise
raw_df['Hits'] = raw_df['Hits'].astype('|S')
raw_df['Value'] = raw_df['Value'].astype('|S')
raw_df['Wage'] = raw_df['Wage'].astype('|S')
raw_df['Release Clause'] = raw_df['Release Clause'].astype('|S')
# this will decode the bitwise data to original string type
raw_df['Hits']= raw_df['Hits'].str.decode('utf-8')
raw_df['Value']= raw_df['Value'].str.decode('utf-8')
raw_df['Wage']= raw_df['Wage'].str.decode('utf-8')
raw_df['Release Clause']= raw_df['Release Clause'].str.decode('utf-8')
After this we can iterate over the data also can apply string like functions and methods.
now we are writing our logic to removed the suffix K and M to be multiplied with 1000 and 1000000 respectively.
# Creating a function named Convert_to_numeric
def convert_to_numeric(value):
# check the case having K
if value[-1] == 'K':
value=value[:-1] # removing suffix K
value=float(value)*1000 # multiplying the value with 1000
return int(value) # returning as integer
# checking the case having M
elif value[-1] == 'M':
value=value[:-1] # removing suffix M
value=float(value)*1000000 # multiplying the value with 1000000
return int(value) # returning as integer
# checking the case with nan value
elif value == 'nan':
return 0 # returning 0
# checking case with not suffic and not nan value
else:
return int(value) # returning the same value as integer
# applying the function
raw_df['Hits'] = raw_df['Hits'].apply(convert_to_numeric)
raw_df['Hits'].head(10)
0 771
1 562
2 150
3 207
4 595
5 248
6 246
7 120
8 1600
9 130
Name: Hits, dtype: int64
# applying the function
raw_df['Value'] = raw_df['Value'].apply(convert_to_numeric)
raw_df['Value'].head(10)
0 103500000
1 63000000
2 120000000
3 129000000
4 132000000
5 111000000
6 120500000
7 102000000
8 185500000
9 110000000
Name: Value, dtype: int64
# applying the function
raw_df['Wage'] = raw_df['Wage'].apply(convert_to_numeric)
raw_df['Wage'].head(10)
0 560000
1 220000
2 125000
3 370000
4 270000
5 240000
6 250000
7 160000
8 160000
9 260000
Name: Wage, dtype: int64
# applying the function
raw_df['Release Clause'] = raw_df['Release Clause'].apply(convert_to_numeric)
raw_df['Release Clause'].head(10)
0 138400000
1 75900000
2 159400000
3 161000000
4 166500000
5 132000000
6 144300000
7 120300000
8 203100000
9 147700000
Name: Release Clause, dtype: int64
raw_df[['Hits', 'Value', 'Wage', 'Release Clause']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Hits 18979 non-null int64
1 Value 18979 non-null int64
2 Wage 18979 non-null int64
3 Release Clause 18979 non-null int64
dtypes: int64(4)
memory usage: 593.2 KB
Now as we can see we have succesfully changed the format and also correctly processed the data. and hence it is done now we will focus on labeled encoding if required.
Label encoding:
“label code” typically refers to a numerical or categorical representation assigned to a specific label or category within a dataset. It is often used when dealing with categorical variables. For example: in a dataset with a “color” variable containing labels like “red,” “green,” and “blue,” label codes may assign the values 0, 1, and 2 respectively to represent each category.
Label encoding is a common technique used in data preprocessing to convert categorical data into numerical format.
raw_df['Preferred Foot'].value_counts()
Right 14445
Left 4534
Name: Preferred Foot, dtype: int64
Here we can see the column Preferred Foot have only 2 value Right & Left so we can label encode it with 0 & 1 respectively. but before doing this we firstly need to convert the Data type of the column to Strings.
# this will change the data to bitwise
raw_df['Preferred Foot'] = raw_df['Preferred Foot'].astype('|S')
# this will decode the bitwise data to original string type
raw_df['Preferred Foot'] = raw_df['Preferred Foot'].str.decode('utf-8')
mapping_preferred_foot = {'Right': 0, 'Left': 1}
raw_df['Preferred Foot'] = raw_df['Preferred Foot'].map(mapping_preferred_foot)
We are creating a dicteonery mapping_preferred_foot which contains the labels for Preferred foot column and then using the map function to map the values over the column and replace it with the respective label.
raw_df['Preferred Foot'].value_counts()
0 14445
1 4534
Name: Preferred Foot, dtype: int64
As we can see this encoded the labels as 0 for Right and 1 for left. Also the number of value are as same as before(14445 Rights and 4534 Lefts).
Now lets look over other columns which have multiple attributed values.
raw_df['Best Position'].value_counts()
CB 3686
ST 2680
CAM 2299
GK 2075
RM 1611
CDM 1445
LB 1086
RB 1079
CM 1047
LM 871
RW 298
RWB 277
LWB 261
LW 186
CF 78
Name: Best Position, dtype: int64
raw_df['A/W'].value_counts()
Medium 12701
High 5288
Low 990
Name: A/W, dtype: int64
raw_df['D/W'].value_counts()
Medium 13956
High 3297
Low 1726
Name: D/W, dtype: int64
So by looking for the value counts of the above columns we can conclude the decision to do the label encoding as the distinct values are very less in number. But on the other hands the columns below can not be Label encoded as the simple reason behind this is many distinct values.
raw_df['Nationality'].value_counts()
England 1705
Germany 1195
Spain 1065
France 1003
Argentina 943
...
Malawi 1
Rwanda 1
SoTomPrncipe 1
Aruba 1
Indonesia 1
Name: Nationality, Length: 164, dtype: int64
Converting the columns data type into string so that we can label encode them.
# this will change the data to bitwise
raw_df['Best Position'] = raw_df['Best Position'].astype('|S')
raw_df['A/W'] = raw_df['A/W'].astype('|S')
raw_df['D/W'] = raw_df['D/W'].astype('|S')
# this will decode the bitwise data to original string type
raw_df['Best Position'] = raw_df['Best Position'].str.decode('utf-8')
raw_df['A/W'] = raw_df['A/W'].str.decode('utf-8')
raw_df['D/W'] = raw_df['D/W'].str.decode('utf-8')
mapping_positions = {
'CB': 0,
'ST': 1,
'CAM': 2,
'GK': 3,
'RM': 4,
'CDM': 5,
'LB': 6,
'RB': 7,
'CM': 8,
'LM': 9,
'RW': 10,
'RWB': 11,
'LWB': 12,
'LW': 13,
'CF': 14
}
raw_df['Best Position'] = raw_df['Best Position'].map(mapping_positions)
Mapping all the destinct values and assigning related label for them and using map function we have replaced all the values with the respective labels.
raw_df['Best Position'].value_counts()
0 3686
1 2680
2 2299
3 2075
4 1611
5 1445
6 1086
7 1079
8 1047
9 871
10 298
11 277
12 261
13 186
14 78
Name: Best Position, dtype: int64
here we can see that we have converted the values of Best Position into label endoded data. But we also have a column named Positions ahving more than one position in a single value data. lets take a look over this column.
raw_df['Positions'].head()
0 RW,ST,CF
1 ST,LW
2 GK
3 CAM,CM
4 LW,CAM
Name: Positions, dtype: object
So as you can see we have this type of data but we want the label encoded data so we will convert it into labels seprated with ,. to do so we need to firstly convert it into string.
raw_df['Positions']=raw_df['Positions'].astype('|S')
raw_df['Positions']=raw_df['Positions'].str.decode('utf-8')
raw_df['Positions'] = raw_df['Positions'].map(lambda x: ','.join(str(mapping_positions.get(p)) for p in x.split(',')))
Here we used a lambda function to pass each value as x then use split to have a list of splited position on the basis of delimiter ,. then used a for loop to pass the value to map function to map the value from mapping_positions we already created then just converting it into string and used join to rejoin the splited positions with , and the labels.
raw_df['Positions'].head()
0 10,1,14
1 1,13
2 3
3 2,8
4 13,2
Name: Positions, dtype: object
Here you can see the positions has been replaced with the respective labels. now we are labeling the columns A/W and D/W column, it is same as we did with the column Best Position.
mapping_intencity = {'High': 0, 'Medium': 1, 'Low': 2}
raw_df['A/W'] = raw_df['A/W'].map(mapping_intencity)
raw_df['A/W'].value_counts()
1 12701
0 5288
2 990
Name: A/W, dtype: int64
A/W column is successfully label encoded.
raw_df['D/W'] = raw_df['D/W'].map(mapping_intencity)
raw_df['D/W'].value_counts()
1 13956
0 3297
2 1726
Name: D/W, dtype: int64
So as we can see above we have succesfully label encoded the columns.
now we are converting some column data type into string and integer with respect to their values and use case.
Converting into right Data type:
# changing Name column to string data type
raw_df['Name']=raw_df['Name'].astype('|S')
raw_df['Name']=raw_df['Name'].str.decode('utf-8')
# changing LongName column to string data type
raw_df['LongName']=raw_df['LongName'].astype('|S')
raw_df['LongName']=raw_df['LongName'].str.decode('utf-8')
# changing Nationality column to string data type
raw_df['Nationality']=raw_df['Nationality'].astype('|S')
raw_df['Nationality']=raw_df['Nationality'].str.decode('utf-8')
# changing Club column to string data type
raw_df['Club']=raw_df['Club'].astype('|S')
raw_df['Club']=raw_df['Club'].str.decode('utf-8')
# changing Weight column to string data type
raw_df['Weight']=raw_df['Weight'].astype('|S')
raw_df['Weight']=raw_df['Weight'].str.decode('utf-8')
# changing W/F column to integer data type
raw_df['W/F']=raw_df['W/F'].astype(int)
# changing SM column to integer data type
raw_df['SM']=raw_df['SM'].astype(int)
# changing IR column to integer data type
raw_df['IR']=raw_df['IR'].astype(int)
Lets see the Weight column what types of destinct values we have.
raw_df['Weight'].value_counts()
70kg 1495
75kg 1457
80kg 1108
72kg 1022
78kg 991
...
190lbs 1
130lbs 1
146lbs 1
203lbs 1
157lbs 1
Name: Weight, Length: 79, dtype: int64
As we can see that we have 2 types of values here kg one and lbs, but we want the values with kg only so we will write a function to convert the values into kg and then into integer.
Processing data to convert pound into kg:
# Creating a function named Convert_to_kg
def convert_to_kg(value):
# check the case having kg
if value[-1] == 'g':
value=value[:-2] # removing suffix kg
return int(value) # returning as integer
# checking the case having lbs
elif value[-1] == 's':
value=value[:-3] # removing suffix lbs
value=float(value)*0.453592 # multiplying the value with 0.453592
return int(value) # returning as integer
# checking the case with nan value
elif value == 'nan':
return 0 # returning 0
# lets just use the function
raw_df['Weight']=raw_df['Weight'].apply(convert_to_kg)
# now we will change the column name as Weight_in_kg
raw_df = raw_df.rename(columns={'Weight': 'Weight_in_kg'})
After converting all values into kg and then integer we have changed the name of column as Weight_in_kg.
raw_df['Weight_in_kg'].head()
0 72
1 83
2 87
3 70
4 68
Name: Weight_in_kg, dtype: int64
raw_df['Weight_in_kg'].tail()
18974 66
18975 65
18976 74
18977 69
18978 75
Name: Weight_in_kg, dtype: int64
So by looking over this we can say we successfully processed the data. But in case of the column Height we can see a mistake.
raw_df['Height'].tail()
18974 178cm
18975 175cm
18976 179cm
18977 175cm
18978 188cm
Name: Height, dtype: object
copy_df['Height'].tail()
18974 178cm
18975 175cm
18976 179cm
18977 175cm
18978 188cm
Name: Height, dtype: object
Because we already removed the special character we also removed ‘ and “ so to fix this we will drop the column and again take it from the copy_df data frame.
Dropping and using copy_df:
# dropping the column Height
raw_df = raw_df.drop(columns='Height')
# taking the Column Height from copy_df with the name Height_in_cm
raw_df['Height_in_cm'] = copy_df['Height']
We already coverted the name of the column as Height_in_cm as we want all the values in cm. Now we will write a function to split the feet and inch then convert it cm and then return as integer.
But first things first we need to convert the data type into string.
# changing Weight column to string data type
raw_df['Height_in_cm']=raw_df['Height_in_cm'].astype('|S')
raw_df['Height_in_cm']=raw_df['Height_in_cm'].str.decode('utf-8')
Pocessing data to convert feet-inch into cm:
# Creating a function named Convert_to_cm
def convert_to_cm(value):
# check the case having cm
if value[-1] == 'm':
value=value[:-2] # removing suffix cm
return int(value) # returning as integer
# checking the case having "
elif value[-1] == '"':
value=value[:-1] # removing suffix "
split_height = value.split("'")
feet_in_cm=float(split_height[0])*30.48 # multiplying the feet value with 30.48 to convert into cm
inch_in_cm=float(split_height[1])*2.54 # multiplying the inch value with 2.54 to convert into cm
value=feet_in_cm+inch_in_cm
return int(value) # returning as integer
# checking the case with nan value
elif value == 'nan':
return 0 # returning 0
# applying the function to convert all the values into cm
raw_df['Height_in_cm']=raw_df['Height_in_cm'].apply(convert_to_cm)
raw_df['Height_in_cm'].tail()
18974 178
18975 175
18976 179
18977 175
18978 188
Name: Height_in_cm, dtype: int64
As we can see we have processed the data and converted all the values into cm and integers. Now we will take a look over the column named Contract just like the column Height we had made a mistake with this column.
raw_df['Contract'].head()
0 20042021
1 20182022
2 20142023
3 20152023
4 20172022
Name: Contract, dtype: object
copy_df['Contract'].head()
0 2004 ~ 2021
1 2018 ~ 2022
2 2014 ~ 2023
3 2015 ~ 2023
4 2017 ~ 2022
Name: Contract, dtype: object
As we can see the difference due to removing of special characters we also removed ~ and thus we made a mistake so for solving this we will drop this column and again take it from the copy data frame.
# dropping the column Contract from raw_df
raw_df = raw_df.drop(columns=['Contract'])
# coping the column Contract from copy_df
raw_df['Contract'] = copy_df['Contract']
raw_df['Contract'].head()
0 2004 ~ 2021
1 2018 ~ 2022
2 2014 ~ 2023
3 2015 ~ 2023
4 2017 ~ 2022
Name: Contract, dtype: object
Now as we have the correct data we will convert it into strings.
# changing Contract column to string data type
raw_df['Contract']=raw_df['Contract'].astype('|S')
raw_df['Contract']=raw_df['Contract'].str.decode('utf-8')
Okay after this lets take a look over the column Joined this tells the joining date of the player.
raw_df['Joined'].head()
0 Jul1,2004
1 Jul10,2018
2 Jul16,2014
3 Aug30,2015
4 Aug3,2017
Name: Joined, dtype: object
As we can see the data type is not appropriate so we will convert it into Datetime.
Converting data type to datetime:
raw_df['Joined'] = raw_df['Joined'].apply(lambda x: pd.to_datetime(x, format='%b%d,%Y'))
We used a lambda function and converted all the values into Datetime data type with the help of method named to_datetime in pandas module.
raw_df['Joined'].head()
0 2004-07-01
1 2018-07-10
2 2014-07-16
3 2015-08-30
4 2017-08-03
Name: Joined, dtype: datetime64[ns]
raw_df['Joined'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 18979 entries, 0 to 18978
Series name: Joined
Non-Null Count Dtype
-------------- -----
18979 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 148.4 KB
But we do not want this we wanted to extract the date, month and year into seperate columns.
Extracting date, month & year into new column:
raw_df['Joined_Date'] = raw_df['Joined'].dt.day
raw_df['Joined_Month'] = raw_df['Joined'].dt.month
raw_df['Joined_Year'] = raw_df['Joined'].dt.year
Here we have created columns named Joined_Date,Joined_Month and Joined_Year and extracted the date, month and year using the methods dt.day,dt.month and dt.year repectively.
raw_df['Joined_Date'].head()
0 1
1 10
2 16
3 30
4 3
Name: Joined_Date, dtype: int64
raw_df['Joined_Month'].head()
0 7
1 7
2 7
3 8
4 8
Name: Joined_Month, dtype: int64
raw_df['Joined_Year'].head()
0 2004
1 2018
2 2014
3 2015
4 2017
Name: Joined_Year, dtype: int64
now as we can see we have the date, month and year of joining in saprated column we do not further need the column Joined so we will drop this column.
raw_df = raw_df.drop(columns=['Joined'])
Also changing the data type of Loan Date End column to Datetime.
raw_df['Loan Date End'] = raw_df['Loan Date End'].apply(lambda x: pd.to_datetime(x, format='%b%d,%Y'))
Now lets look over the raw_df again.
raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18979 entries, 0 to 18978
Data columns (total 77 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 18979 non-null int64
1 Name 18979 non-null object
2 LongName 18979 non-null object
3 Nationality 18979 non-null object
4 Age 18979 non-null int64
5 ↓OVA 18979 non-null int64
6 POT 18979 non-null int64
7 Club 18979 non-null object
8 Positions 18979 non-null object
9 Weight_in_kg 18979 non-null int64
10 Preferred Foot 18979 non-null int64
11 BOV 18979 non-null int64
12 Best Position 18979 non-null int64
13 Loan Date End 1013 non-null datetime64[ns]
14 Value 18979 non-null int64
15 Wage 18979 non-null int64
16 Release Clause 18979 non-null int64
17 Attacking 18979 non-null int64
18 Crossing 18979 non-null int64
19 Finishing 18979 non-null int64
20 Heading Accuracy 18979 non-null int64
21 Short Passing 18979 non-null int64
22 Volleys 18979 non-null int64
23 Skill 18979 non-null int64
24 Dribbling 18979 non-null int64
25 Curve 18979 non-null int64
26 FK Accuracy 18979 non-null int64
27 Long Passing 18979 non-null int64
28 Ball Control 18979 non-null int64
29 Movement 18979 non-null int64
30 Acceleration 18979 non-null int64
31 Sprint Speed 18979 non-null int64
32 Agility 18979 non-null int64
33 Reactions 18979 non-null int64
34 Balance 18979 non-null int64
35 Power 18979 non-null int64
36 Shot Power 18979 non-null int64
37 Jumping 18979 non-null int64
38 Stamina 18979 non-null int64
39 Strength 18979 non-null int64
40 Long Shots 18979 non-null int64
41 Mentality 18979 non-null int64
42 Aggression 18979 non-null int64
43 Interceptions 18979 non-null int64
44 Positioning 18979 non-null int64
45 Vision 18979 non-null int64
46 Penalties 18979 non-null int64
47 Composure 18979 non-null int64
48 Defending 18979 non-null int64
49 Marking 18979 non-null int64
50 Standing Tackle 18979 non-null int64
51 Sliding Tackle 18979 non-null int64
52 Goalkeeping 18979 non-null int64
53 GK Diving 18979 non-null int64
54 GK Handling 18979 non-null int64
55 GK Kicking 18979 non-null int64
56 GK Positioning 18979 non-null int64
57 GK Reflexes 18979 non-null int64
58 Total Stats 18979 non-null int64
59 Base Stats 18979 non-null int64
60 W/F 18979 non-null int32
61 SM 18979 non-null int32
62 A/W 18979 non-null int64
63 D/W 18979 non-null int64
64 IR 18979 non-null int32
65 PAC 18979 non-null int64
66 SHO 18979 non-null int64
67 PAS 18979 non-null int64
68 DRI 18979 non-null int64
69 DEF 18979 non-null int64
70 PHY 18979 non-null int64
71 Hits 18979 non-null int64
72 Height_in_cm 18979 non-null int64
73 Contract 18979 non-null object
74 Joined_Date 18979 non-null int64
75 Joined_Month 18979 non-null int64
76 Joined_Year 18979 non-null int64
dtypes: datetime64[ns](1), int32(3), int64(67), object(6)
memory usage: 10.9+ MB
So we finally completed the data cleaning for the raw_df.
Also used the copy_df as we meesed up some columns like Height and Contract.
Now lets create the mapping data frame we used to map the values in the process of label encoding, for future as we might forget the labels or for someone who will reffer this data.
Positions_df = pd.DataFrame.from_dict(mapping_positions, orient='index', columns=['Position_Code'])
Positions_df
Position_Code
CB 0
ST 1
CAM 2
GK 3
RM 4
CDM 5
LB 6
RB 7
CM 8
LM 9
RW 10
RWB 11
LWB 12
LW 13
CF 14
Here we have created a data frame named Positions_df we used the function from_dict to pass the dictionery and used the orient=’index’ so the index of the passed dictionery will be treated as index for this data frame and lastly we named the column containing label_code as Position_Code.
Intencity_df = pd.DataFrame.from_dict(mapping_intencity, orient='index', columns=['intencity_Code'])
Intencity_df
intencity_Code
High 0
Medium 1
Low 2
Simillarly we have created another data frame named Intencity_df which contains the labels for columns A/W & D/W.
Saving the cleaned data into csv files:
# exporting the data as data.csv
raw_.to_csv('data.csv', index=False)
# exporting the Intencity_df as Intencity_label.csv
Intencity_df.to_csv('Intencity_label.csv', index=False)
# exporting the Positions_df as Positions_label.csv
Positions_df.to_csv('Positions_label.csv', index=False)
We have used to_csv method to export the data frame as csv and used index=’False’ as we do not wanted to have index in our csv file.
Closing the project:
In my data cleaning project, I utilized pandas to clean the Fifa 2021 dataset obtained from Kaggle. To ensure data integrity, I started by making a copy of the dataset, allowing me to revert any changes if needed. Throughout the cleaning process, I encountered several challenges with specific columns but successfully recovered the original data using the copied DataFrame.
One of the key tasks involved transforming data types to their appropriate formats, ensuring consistency and accuracy. I also identified and dropped unnecessary columns, streamlining the dataset for further analysis. In addition, I extracted valuable information from the ‘joined’ column, creating new columns for the date, month, and year of joining.
To enhance data analysis capabilities, I applied label encoding to convert categorical labels into numerical values. Furthermore, I performed data processing operations, such as converting weights from pounds to kilograms and heights from feet and inches to centimeters. I leveraged the re module to efficiently remove unwanted characters and ensure data cleanliness.
Overall, this data cleaning project has been instrumental in preparing the Fifa 2021 dataset for advanced analysis and insights. The transformed and refined dataset now sets the stage for more comprehensive exploration and data-driven decision-making.
Acknowledgement:
To all those who have followed this project, I sincerely hope that you have found it informative and engaging. Through this project, I aimed to not only clean and prepare the FIFA 2021 dataset but also inspire you to explore the world of data analysis and visualization further.
I believe that this project has provided valuable insights into the data cleaning process and showcased the power of pandas in handling and transforming datasets. I encourage you to continue this project by conducting your own analysis and visualization. There is so much more to explore within the dataset, and by employing different analytical methods and visualization techniques, you can gain deeper insights and make meaningful discoveries.
Once again, thank you for following this project and being a part of this learning journey.