In [101]: df = pd.DataFrame({'A': ['one', 'one', 'two',
'three'] * 3,
.....: 'B': ['A', 'B', 'C'] * 4,
.....: 'C': ['foo', 'foo', 'foo', 'bar',
'bar', 'bar'] * 2,
.....: 'D': np.random.randn(12),
.....: 'E': np.random.randn(12)})
.....:
In [102]: df
Out[102]:
A B
C D E
0 one A foo
-1.202872 0.047609
1 one B foo
-1.814470 -0.136473
2 two C
foo 1.018601 -0.561757
3 three A bar
-0.595447 -1.623033
4 one B
bar 1.395433 0.029399
5 one C bar
-0.392670 -0.542108
6 two A
foo 0.007207 0.282696
7 three B
foo 1.928123 -0.087302
8 one C foo
-0.055224 -1.575170
9 one A
bar 2.395985 1.771208
10 two B
bar 1.552825 0.816482
11 three C
bar 0.166599 1.100230
We can produce pivot tables from this data very easily:
In [103]: pd.pivot_table(df, values='D', index=['A', 'B'],
columns=['C'])
Out[103]:
C bar foo
A B
one A 2.395985 -1.202872
B 1.395433 -1.814470
C -0.392670
-0.055224
three A -0.595447
NaN
B NaN
1.928123
C 0.166599
NaN
two A NaN
0.007207
B 1.552825
NaN
C NaN
1.018601
Time series
pandas has simple, powerful, and efficient functionality for
performing resampling operations during frequency conversion (e.g., converting
secondly data into 5-minutely data). This is extremely common in, but not
limited to, financial applications. See the Time
Series section.
In [104]: rng = pd.date_range('1/1/2012', periods=100,
freq='S')
In [105]: ts = pd.Series(np.random.randint(0, 500, len(rng)),
index=rng)
In [106]: ts.resample('5Min').sum()
Out[106]:
2012-01-01 24182
Freq: 5T, dtype: int64
Time zone representation:
In [107]: rng = pd.date_range('3/6/2012 00:00', periods=5,
freq='D')
In [108]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [109]: ts
Out[109]:
2012-03-06 1.857704
2012-03-07 -1.193545
2012-03-08 0.677510
2012-03-09 -0.153931
2012-03-10 0.520091
Freq: D, dtype: float64
In [110]: ts_utc = ts.tz_localize('UTC')
In [111]: ts_utc
Out[111]:
2012-03-06 00:00:00+00:00
1.857704
2012-03-07 00:00:00+00:00
-1.193545
2012-03-08 00:00:00+00:00
0.677510
2012-03-09 00:00:00+00:00
-0.153931
2012-03-10 00:00:00+00:00
0.520091
Freq: D, dtype: float64
Converting to another time zone:
In [112]: ts_utc.tz_convert('US/Eastern')
Out[112]:
2012-03-05 19:00:00-05:00
1.857704
2012-03-06 19:00:00-05:00
-1.193545
2012-03-07 19:00:00-05:00
0.677510
2012-03-08 19:00:00-05:00
-0.153931
2012-03-09 19:00:00-05:00
0.520091
Freq: D, dtype: float64
Converting between time span representations:
In [113]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [114]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [115]: ts
Out[115]:
2012-01-31 -1.475051
2012-02-29 0.722570
2012-03-31 -0.322646
2012-04-30 -1.601631
2012-05-31 0.778033
Freq: M, dtype: float64
In [116]: ps = ts.to_period()
In [117]: ps
Out[117]:
2012-01 -1.475051
2012-02 0.722570
2012-03 -0.322646
2012-04 -1.601631
2012-05 0.778033
Freq: M, dtype: float64
In [118]: ps.to_timestamp()
Out[118]:
2012-01-01 -1.475051
2012-02-01 0.722570
2012-03-01 -0.322646
2012-04-01 -1.601631
2012-05-01 0.778033
Freq: MS, dtype: float64
Converting between period and timestamp enables some
convenient arithmetic functions to be used. In the following example, we
convert a quarterly frequency with year ending in November to 9am of the end of
the month following the quarter end:
In [119]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
In [120]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [121]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H',
's') + 9
In [122]: ts.head()
Out[122]:
1990-03-01 09:00
-0.289342
1990-06-01 09:00
0.233141
1990-09-01 09:00
-0.223540
1990-12-01 09:00
0.542054
1991-03-01 09:00
-0.688585
Freq: H, dtype: float64
Categoricals
pandas can include categorical data in a DataFrame. For
full docs, see the categorical introduction and the API
documentation.
In [123]: df = pd.DataFrame({"id": [1, 2, 3,
4, 5, 6],
.....: "raw_grade": ['a',
'b', 'b', 'a', 'a', 'e']})
.....:
Convert the raw grades to a categorical data type.
In [124]: df["grade"] = df["raw_grade"].astype("category")
In [125]: df["grade"]
Out[125]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
Rename the categories to more meaningful names (assigning
to Series.cat.categories is inplace!).
In [126]: df["grade"].cat.categories = ["very
good", "good", "very bad"]
Reorder the categories and simultaneously add the missing
categories (methods under Series .cat return a new Series by
default).
In [127]: df["grade"] = df["grade"].cat.set_categories(["very
bad", "bad", "medium",
.....: "good",
"very good"])
.....:
In [128]: df["grade"]
Out[128]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very
good]
Sorting is per order in the categories, not lexical order.
In [129]: df.sort_values(by="grade")
Out[129]:
id raw_grade grade
5 6 e
very bad
1 2 b
good
2 3 b
good
0 1 a
very good
3 4 a
very good
4 5 a
very good
Grouping by a categorical column also shows empty
categories.
In [130]: df.groupby("grade").size()
Out[130]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
Plotting
See the Plotting docs.
We use the standard convention for referencing the
matplotlib API:
In [131]: import matplotlib.pyplot as plt
In [132]: plt.close('all')
In [133]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:
In [134]: ts = ts.cumsum()
In [135]: ts.plot()
Out[135]: <matplotlib.axes._subplots.AxesSubplot
at 0x7f3d511cfa10>
On a DataFrame, the plot() method is a
convenience to plot all of the columns with labels:
In [136]: df = pd.DataFrame(np.random.randn(1000, 4),
index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [137]: df = df.cumsum()
In [138]: plt.figure()
Out[138]: <Figure size 640x480 with 0 Axes>
In [139]: df.plot()
Out[139]: <matplotlib.axes._subplots.AxesSubplot
at 0x7f3d51028650>
In [140]: plt.legend(loc='best')
Out[140]: <matplotlib.legend.Legend at
0x7f3d50ff5250>
Getting data in/out
CSV
Writing to a csv file.
In [141]: df.to_csv('foo.csv')
Reading from a csv file.
In [142]: pd.read_csv('foo.csv')
Out[142]:
Unnamed: 0 A B C D
0 2000-01-01 0.350262
0.843315 1.798556 0.782234
1 2000-01-02 -0.586873
0.034907 1.923792 -0.562651
2 2000-01-03 -1.245477
-0.963406 2.269575 -1.612566
3 2000-01-04 -0.252830
-0.498066 3.176886 -1.275581
4 2000-01-05 -1.044057
0.118042 2.768571 0.386039
.. ... ... ... ... ...
995 2002-09-22
-48.017654 31.474551 69.146374 -47.541670
996 2002-09-23
-47.207912 32.627390 68.505254 -48.828331
997 2002-09-24
-48.907133 31.990402 67.310924 -49.391051
998 2002-09-25
-50.146062 33.716770 67.717434 -49.037577
999 2002-09-26 -49.724318 33.479952
68.108014 -48.822030
[1000 rows x 5 columns]
HDF5
Reading and writing to HDFStores.
Writing to a HDF5 Store.
In [143]: df.to_hdf('foo.h5', 'df')
Reading from a HDF5 Store.
In [144]: pd.read_hdf('foo.h5', 'df')
Out[144]:
A B C
D
2000-01-01
0.350262 0.843315 1.798556
0.782234
2000-01-02
-0.586873 0.034907 1.923792
-0.562651
2000-01-03
-1.245477 -0.963406 2.269575
-1.612566
2000-01-04
-0.252830 -0.498066 3.176886
-1.275581
2000-01-05
-1.044057 0.118042 2.768571
0.386039
...
... ... ... ...
2002-09-22 -48.017654
31.474551 69.146374 -47.541670
2002-09-23 -47.207912
32.627390 68.505254 -48.828331
2002-09-24 -48.907133
31.990402 67.310924 -49.391051
2002-09-25 -50.146062
33.716770 67.717434 -49.037577
2002-09-26 -49.724318
33.479952 68.108014 -48.822030
[1000 rows x 4 columns]
Excel
Reading and writing to MS
Excel.
Writing to an excel file.
In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
Reading from an excel file.
In [146]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None,
na_values=['NA'])
Out[146]:
Unnamed: 0 A B C D
0 2000-01-01 0.350262
0.843315 1.798556 0.782234
1 2000-01-02 -0.586873
0.034907 1.923792 -0.562651
2 2000-01-03 -1.245477 -0.963406
2.269575 -1.612566
3 2000-01-04 -0.252830
-0.498066 3.176886 -1.275581
4 2000-01-05 -1.044057
0.118042 2.768571 0.386039
.. ... ... ... ... ...
995 2002-09-22 -48.017654
31.474551 69.146374 -47.541670
996 2002-09-23 -47.207912
32.627390 68.505254 -48.828331
997 2002-09-24 -48.907133
31.990402 67.310924 -49.391051
998 2002-09-25 -50.146062
33.716770 67.717434 -49.037577
999 2002-09-26 -49.724318
33.479952 68.108014 -48.822030
[1000 rows x 5 columns]
Gotchas
If you are attempting to perform an operation you might see
an exception like:
>>> if pd.Series([False, True,
False]):
... print("I
was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use
a.empty, a.any() or a.all().