My Python Queries: Short introduction to Pandas -2

In [101]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,

.....: 'B': ['A', 'B', 'C'] * 4,

.....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,

.....: 'D': np.random.randn(12),

.....: 'E': np.random.randn(12)})

.....:

In [102]: df

Out[102]:

A B C D E

0 one A foo -1.202872 0.047609

1 one B foo -1.814470 -0.136473

2 two C foo 1.018601 -0.561757

3 three A bar -0.595447 -1.623033

4 one B bar 1.395433 0.029399

5 one C bar -0.392670 -0.542108

6 two A foo 0.007207 0.282696

7 three B foo 1.928123 -0.087302

8 one C foo -0.055224 -1.575170

9 one A bar 2.395985 1.771208

10 two B bar 1.552825 0.816482

11 three C bar 0.166599 1.100230

We can produce pivot tables from this data very easily:

In [103]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Out[103]:

C bar foo

A B

one A 2.395985 -1.202872

B 1.395433 -1.814470

C -0.392670 -0.055224

three A -0.595447 NaN

B NaN 1.928123

C 0.166599 NaN

two A NaN 0.007207

B 1.552825 NaN

C NaN 1.018601

Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

In [104]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [105]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [106]: ts.resample('5Min').sum()

Out[106]:

2012-01-01 24182

Freq: 5T, dtype: int64

Time zone representation:

In [107]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [108]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [109]: ts

Out[109]:

2012-03-06 1.857704

2012-03-07 -1.193545

2012-03-08 0.677510

2012-03-09 -0.153931

2012-03-10 0.520091

Freq: D, dtype: float64

In [110]: ts_utc = ts.tz_localize('UTC')

In [111]: ts_utc

Out[111]:

2012-03-06 00:00:00+00:00 1.857704

2012-03-07 00:00:00+00:00 -1.193545

2012-03-08 00:00:00+00:00 0.677510

2012-03-09 00:00:00+00:00 -0.153931

2012-03-10 00:00:00+00:00 0.520091

Freq: D, dtype: float64

Converting to another time zone:

In [112]: ts_utc.tz_convert('US/Eastern')

Out[112]:

2012-03-05 19:00:00-05:00 1.857704

2012-03-06 19:00:00-05:00 -1.193545

2012-03-07 19:00:00-05:00 0.677510

2012-03-08 19:00:00-05:00 -0.153931

2012-03-09 19:00:00-05:00 0.520091

Freq: D, dtype: float64

Converting between time span representations:

In [113]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [114]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [115]: ts

Out[115]:

2012-01-31 -1.475051

2012-02-29 0.722570

2012-03-31 -0.322646

2012-04-30 -1.601631

2012-05-31 0.778033

Freq: M, dtype: float64

In [116]: ps = ts.to_period()

In [117]: ps

Out[117]:

2012-01 -1.475051

2012-02 0.722570

2012-03 -0.322646

2012-04 -1.601631

2012-05 0.778033

Freq: M, dtype: float64

In [118]: ps.to_timestamp()

Out[118]:

2012-01-01 -1.475051

2012-02-01 0.722570

2012-03-01 -0.322646

2012-04-01 -1.601631

2012-05-01 0.778033

Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

In [119]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [120]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [121]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [122]: ts.head()

Out[122]:

1990-03-01 09:00 -0.289342

1990-06-01 09:00 0.233141

1990-09-01 09:00 -0.223540

1990-12-01 09:00 0.542054

1991-03-01 09:00 -0.688585

Freq: H, dtype: float64

Categoricals

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

In [123]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],

.....: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})

.....:

Convert the raw grades to a categorical data type.

In [124]: df["grade"] = df["raw_grade"].astype("category")

In [125]: df["grade"]

Out[125]:

0 a

1 b

2 b

3 a

4 a

5 e

Name: grade, dtype: category

Categories (3, object): [a, b, e]

Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!).

In [126]: df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series by default).

In [127]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium",

.....: "good", "very good"])

.....:

In [128]: df["grade"]

Out[128]:

0 very good

1 good

2 good

3 very good

4 very good

5 very bad

Name: grade, dtype: category

Categories (5, object): [very bad, bad, medium, good, very good]

Sorting is per order in the categories, not lexical order.

In [129]: df.sort_values(by="grade")

Out[129]:

id raw_grade grade

5 6 e very bad

1 2 b good

2 3 b good

0 1 a very good

3 4 a very good

4 5 a very good

Grouping by a categorical column also shows empty categories.

In [130]: df.groupby("grade").size()

Out[130]:

grade

very bad 1

bad 0

medium 0

good 2

very good 3

dtype: int64

Plotting

See the Plotting docs.

We use the standard convention for referencing the matplotlib API:

In [131]: import matplotlib.pyplot as plt

In [132]: plt.close('all')

In [133]: ts = pd.Series(np.random.randn(1000),

.....: index=pd.date_range('1/1/2000', periods=1000))

.....:

In [134]: ts = ts.cumsum()

In [135]: ts.plot()

Out[135]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3d511cfa10>

On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:

In [136]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,

.....: columns=['A', 'B', 'C', 'D'])

.....:

In [137]: df = df.cumsum()

In [138]: plt.figure()

Out[138]: <Figure size 640x480 with 0 Axes>

In [139]: df.plot()

Out[139]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3d51028650>

In [140]: plt.legend(loc='best')

Out[140]: <matplotlib.legend.Legend at 0x7f3d50ff5250>

Getting data in/out

CSV

Writing to a csv file.

In [141]: df.to_csv('foo.csv')

Reading from a csv file.

In [142]: pd.read_csv('foo.csv')

Out[142]:

Unnamed: 0 A B C D

0 2000-01-01 0.350262 0.843315 1.798556 0.782234

1 2000-01-02 -0.586873 0.034907 1.923792 -0.562651

2 2000-01-03 -1.245477 -0.963406 2.269575 -1.612566

3 2000-01-04 -0.252830 -0.498066 3.176886 -1.275581

4 2000-01-05 -1.044057 0.118042 2.768571 0.386039

.. ... ... ... ... ...

995 2002-09-22 -48.017654 31.474551 69.146374 -47.541670

996 2002-09-23 -47.207912 32.627390 68.505254 -48.828331

997 2002-09-24 -48.907133 31.990402 67.310924 -49.391051

998 2002-09-25 -50.146062 33.716770 67.717434 -49.037577

999 2002-09-26 -49.724318 33.479952 68.108014 -48.822030

[1000 rows x 5 columns]

HDF5

Reading and writing to HDFStores.

Writing to a HDF5 Store.

In [143]: df.to_hdf('foo.h5', 'df')

Reading from a HDF5 Store.

In [144]: pd.read_hdf('foo.h5', 'df')

Out[144]:

A B C D

2000-01-01 0.350262 0.843315 1.798556 0.782234

2000-01-02 -0.586873 0.034907 1.923792 -0.562651

2000-01-03 -1.245477 -0.963406 2.269575 -1.612566

2000-01-04 -0.252830 -0.498066 3.176886 -1.275581

2000-01-05 -1.044057 0.118042 2.768571 0.386039

... ... ... ... ...

2002-09-22 -48.017654 31.474551 69.146374 -47.541670

2002-09-23 -47.207912 32.627390 68.505254 -48.828331

2002-09-24 -48.907133 31.990402 67.310924 -49.391051

2002-09-25 -50.146062 33.716770 67.717434 -49.037577

2002-09-26 -49.724318 33.479952 68.108014 -48.822030

[1000 rows x 4 columns]

Excel

Reading and writing to MS Excel.

Writing to an excel file.

In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

Reading from an excel file.

In [146]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Out[146]:

Unnamed: 0 A B C D

0 2000-01-01 0.350262 0.843315 1.798556 0.782234

1 2000-01-02 -0.586873 0.034907 1.923792 -0.562651

2 2000-01-03 -1.245477 -0.963406 2.269575 -1.612566

3 2000-01-04 -0.252830 -0.498066 3.176886 -1.275581

4 2000-01-05 -1.044057 0.118042 2.768571 0.386039

.. ... ... ... ... ...

995 2002-09-22 -48.017654 31.474551 69.146374 -47.541670

996 2002-09-23 -47.207912 32.627390 68.505254 -48.828331

997 2002-09-24 -48.907133 31.990402 67.310924 -49.391051

998 2002-09-25 -50.146062 33.716770 67.717434 -49.037577

999 2002-09-26 -49.724318 33.479952 68.108014 -48.822030

[1000 rows x 5 columns]

Gotchas

If you are attempting to perform an operation you might see an exception like:

>>> if pd.Series([False, True, False]):

... print("I was true")

Traceback

...

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

My Python Queries

Friday, April 10, 2020

Short introduction to Pandas -2

No comments:

Post a Comment

Short introduction to Pandas -2

Search This Blog