Kaggle 的 Pandas 课程的笔记。
1. Creating, Reading and Writing Panda 基本数据类型 DataFrame DataFrame:表格。
1 2 pd.DataFrame({'Bob' : [50 , 131 ], 'Sue' : ['Ok' , 'Next' ]}) pd.DataFrame({'Bob' : [50 , 131 ], 'Sue' : ['Ok' , 'Next' ]}, index=['PA' , 'PB' ])
Bob
Sue
0
50
131
1
Ok
Next
Bob
Sue
PA
50
131
PB
Ok
Next
Series Series:数据序列。Series 没有像上面“Bob”、“Sue”一样的列名,最多只有一个统称(name
)。
1 pd.Series([1 , 2 , 3 , 4 , 5 ])
1 2 3 4 5 6 0 1 1 2 2 3 3 4 4 5 dtype: int64
1 pd.Series([30 , 35 , 40 ], index=['2015 Sales' , '2016 Sales' , '2017 Sales' ], name='Product A' )
1 2 3 4 2015 Sales 30 2016 Sales 35 2017 Sales 40 Name: Product A, dtype: int64
基本操作 1 2 3 4 5 6 7 wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv" ) wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv" , index_col=0 ) wine_reviews.shape wine_reviews.head() animals.to_csv('cows_and_goats.csv' )
2. Indexing, Selecting & Assigning 基于下标的选择 用 iloc
来进行基于下标的选择。
注意,iloc
和下文的 loc
第一个参数都是选择行(也就是选择 record),第二个参数才选择列(指定想要查询的具体数据类型),与原生 Python 的下表查询不一样。
1 2 3 4 5 reviews.iloc[0 ] reviews.iloc[:, 0 ] reviews.iloc[:3 , 0 ] reviews.iloc[[0 , 2 ], 0 ] df.iloc[0 :1000 ]
基于标签的选择 1 2 3 4 reviews.loc[0 , 'country' ] reviews.loc[:, ['taster_name' , 'taster_twitter_handle' , 'points' ]] reviews.loc[:, 'Apples' :'Potatoes' ] df.loc[0 :1000 ]
注意:loc
列参数选择的区间为闭区间,iloc
列参数选择的区间为左闭右开区间(与 Python 原生的 range()
相同) 。
更改索引 1 reviews.set_index("title" )
条件选择 1 2 3 4 5 reviews.country == 'Italy' reviews.loc[reviews.country == 'Italy' ] reviews.loc[(reviews.country == 'Italy' ) & (reviews.points >= 90 )] reviews.loc[reviews.country.isin(['Italy' , 'France' ])] reviews.loc[reviews.price.isnull()]
赋值 1 2 reviews['critic' ] = 'everyone' reviews['index_backwards' ] = range (len (reviews), 0 , -1 )
3. Summary Functions and Maps 获取某一列的统计信息 1 2 3 4 reviews.points.describe() reviews.points.mean() reviews.taster_name.unique() reviews.taster_name.value_counts()
映射 1 2 3 4 5 6 7 8 9 10 reviews.points.map (lambda x: x * 100 ) reviews.points * 100 reviews.country + " - " + reviews.region_1 def remean_points (row ): row.points = row.points * 100 return row reviews.apply(remean_points, axis='columns' ) reviews.apply(remean_points, axis='index' )
若 apply()
提供的函数返回为单值,则 apply
返回的将会是 Series 而非 DataFrame。
注意:这两个函数都不会修改原表的内容。
4. Grouping and Sorting 分组 1 2 3 4 5 6 7 8 reviews.groupby('points' ).points.count() reviews.groupby('points' ).price.min () reviews.groupby('winery' ).apply(lambda df: df.title.iloc[0 ]) reviews.groupby(['country' , 'province' ]).apply(lambda df: df.loc[df.points.idxmax()]) reviews.groupby(['country' ]).price.agg([len , min , max ]) countries_reviewed = reviews.groupby(['country' , 'province' ]).description.agg([len ]) countries_reviewed.reset_index()
排序 1 2 3 4 5 6 7 dataframe.sort_values(by='len' ) dataframe.sort_values(by='len' , ascending=False ) dataframe.sort_values(by=['country' , 'len' ]) dataframe.sort_index() series.sort_values() dataframe.sort_index()
同样地,排序不会修改原表的内容。
5. Data Types and Missing Values Dtypes 1 2 3 4 reviews.price.dtype reviews.dtypes reviews.points.astype('float64' ) reviews.index.dtype
丢失数据(NaN) 1 2 reviews.region_2.fillna("Unknown" ) reviews.taster_twitter_handle.replace("@kerinokeefe" , "@kerino" )
6. Renaming and Combining 重命名 1 2 3 reviews.rename(columns={'points' : 'score' }) reviews.rename(index={0 : 'firstEntry' , 1 : 'secondEntry' }) reviews.rename_axis("wines" , axis='rows' ).rename_axis("fields" , axis='columns' )
结合 三种方式(从简单到复杂):concat(), join(), merge()
1 2 3 4 5 6 7 8 9 10 11 12 canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv" ) british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv" ) pd.concat([canadian_youtube, british_youtube]) left = canadian_youtube.set_index(['title' , 'trending_date' ]) right = british_youtube.set_index(['title' , 'trending_date' ]) left.join(right, lsuffix='_CAN' , rsuffix='_UK' )
删除 1 2 3 df.drop('age' , axis='columns' ) df.drop(2 , axis='index' ) df.age.dropna()
One-hot 编码
color
class
0
green
A
1
red
B
2
blue
C
1 pd.get_dummies(df.color, prefix='hot' )
hot_blue
hot_green
hot_red
0
0
1
0
1
0
0
1
2
1
0
0
其他统计 统计 a
、b
两列中有多少个数大于 $0$:
1 X['count' ] = (X[['a' , 'b' ]] > 0 ).sum (axis='columns' )
将 description
列根据 _
进行 split,返回一个多列的 DataFrame:
1 X['description' ].split('_' , n=1 , expand=True )