Pandas
1. Loading data to pandas
1import pandas as pd
2df = pd.read_csv('pokemon_data.csv')
incase csv file contains no header
1df = pd.read_csv('pokemon_data.csv', header=None)
You can also load data from files other than csv, such as txt, xlsx
1df_xlsx = pd.read_excel('pokemon_data.xlsx')
2df_txt = pd.read_csv('pokemon_data.txt', delimiter='\t') # tab delimiter
2. Reading data in pandas
- Read headers using
df.columns
. - Read one/multiple columns:
df['single_col']
,df[['col1', 'col2', 'col3']]
- Read rows
- Iterate through all rows:
1for index, row in df.iterrows(): 2 print(index, row['Name'])
- read a subset of rows:
df.iloc[0:4]
- read rows satified specific conditions:
1df.loc[df['type'] == 'Grass'] 2# can join multiple conditions using | & ~ (single symbol)
- Read a specific location (cell):
df.iloc[r, c]
- Iterate through all rows:
3. Sorting data
Pandas supports sorting by multiple criteria in ascending/descending order by each criterio
1df.sort_values(['Type 1', 'HP'], ascending=[1, 0])
4. Modifying data
- Add a new columns based on information from other columns
1df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']
you can use the dot syntax only if the column name is single word
- drop columns:
df = df.drop(columns=['Total'])
- Build a new dataframe by selecting some columns from old one
df = df[cols[0:4] + [cols[-1]]+cols[4:12]]
- Conditional Changes
1df.loc[df['Total'] > 500, ['Generation','Legendary']] = ['Test 1', 'Test 2']
5. Saving data (exporting to files)
Like loading data, you can export into desired format
1df.to_csv('modified.csv', index=False)
2df.to_excel('modified.xlsx', index=False)
3df.to_csv('modified.txt', index=False, sep='\t')
6. Filtering data
1new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison') & (df['HP'] > 70)]
2
3new_df.reset_index(drop=True, inplace=True)
6. Aggregate Statistics/ Groupby
groupby method return DataframeGroupBy object, often follow by aggregating methods like sum, mean, count
1df['count'] = 1 # add a supporting column for precisely counting
2df.groupby(['Type 1', 'Type 2']).count()['count']
7. Working with large amount of data
Pandas supports reading the file by chunk
1new_df = pd.DataFrame(columns=df.columns)
2
3for df in pd.read_csv('modified.csv', chunksize=5):
4 results = df.groupby(['Type 1']).count()
5
6 new_df = pd.concat([new_df, results])