Pandas

1. Loading data to pandas

1import pandas as pd
2df = pd.read_csv('pokemon_data.csv')

incase csv file contains no header

1df = pd.read_csv('pokemon_data.csv', header=None)

You can also load data from files other than csv, such as txt, xlsx

1df_xlsx = pd.read_excel('pokemon_data.xlsx')
2df_txt = pd.read_csv('pokemon_data.txt', delimiter='\t') # tab delimiter

2. Reading data in pandas

  • Read headers using df.columns.
  • Read one/multiple columns: df['single_col'], df[['col1', 'col2', 'col3']]
  • Read rows
    • Iterate through all rows:
      1for index, row in df.iterrows():
      2    print(index, row['Name'])
      
    • read a subset of rows: df.iloc[0:4]
    • read rows satified specific conditions:
      1df.loc[df['type'] == 'Grass']
      2# can join multiple conditions using | & ~ (single symbol)
      
    • Read a specific location (cell): df.iloc[r, c]

3. Sorting data

Pandas supports sorting by multiple criteria in ascending/descending order by each criterio

1df.sort_values(['Type 1', 'HP'], ascending=[1, 0])

4. Modifying data

  • Add a new columns based on information from other columns
1df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp. Atk'] + df['Sp. Def'] + df['Speed']

you can use the dot syntax only if the column name is single word

  • drop columns: df = df.drop(columns=['Total'])
  • Build a new dataframe by selecting some columns from old one
    df = df[cols[0:4] + [cols[-1]]+cols[4:12]]
  • Conditional Changes
1df.loc[df['Total'] > 500, ['Generation','Legendary']] = ['Test 1', 'Test 2']

5. Saving data (exporting to files)

Like loading data, you can export into desired format

1df.to_csv('modified.csv', index=False)
2df.to_excel('modified.xlsx', index=False)
3df.to_csv('modified.txt', index=False, sep='\t')

6. Filtering data

1new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison') & (df['HP'] > 70)]
2
3new_df.reset_index(drop=True, inplace=True)

6. Aggregate Statistics/ Groupby

groupby method return DataframeGroupBy object, often follow by aggregating methods like sum, mean, count

1df['count'] = 1  # add a supporting column for precisely counting
2df.groupby(['Type 1', 'Type 2']).count()['count']

7. Working with large amount of data

Pandas supports reading the file by chunk

1new_df = pd.DataFrame(columns=df.columns)
2
3for df in pd.read_csv('modified.csv', chunksize=5):
4    results = df.groupby(['Type 1']).count()
5    
6    new_df = pd.concat([new_df, results])