Welcome to Python!¶

I know what you are all thinking...finally!

Okay let's check out the basics of Python.

I am typing this inside of Jupyter notebook which yields a markdown/programming environment similar to R markdown.

First let us discuss the basics of Python. Here are our standard types:

3

3

type(3)

int

3.0

3.0

type(3.0)

float

type('c')

str

type('ca')

str

type("ca")

str

True

True

type(True)

bool

type(T) #Not defined unlike R

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-bb69594bccd4> in <module>()
----> 1 type(T) #Not defined unlike R

NameError: name 'T' is not defined

type(true)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-83ab3fb73e0b> in <module>()
----> 1 type(true)

NameError: name 'true' is not defined

type(x=3) #An assignment does not return a value. This is different from C/C++/R.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-b8133a4db803> in <module>()
----> 1 type(x=3) #An assignment does not return a value. This is different from C/C++/R.

TypeError: type() takes 1 or 3 arguments

x=2 #assignment

x

2

x==3 #Boolean

False

Lists and dictionaries¶

Okay let's check out some syntetic data structures.

y = [4.5,x, 'c'] #lists can contain different types

type(y)

list

y[0] #zero indexing

4.5

y[1]

2

y[-1] #last entry

'c'

y[-2]

2

len(y)

3

y = y + ['a','b','d']

y

[4.5, 2, 'c', 'a', 'b', 'd']

y[1:3] #Slicing!

[2, 'c']

y[1:4]

[2, 'c', 'a']

y[1:6:2] # jump by twos

[2, 'a', 'd']

y[:] #copy entire list

[4.5, 2, 'c', 'a', 'b', 'd']

z = y

z[1]=3

y

[4.5, 3, 'c', 'a', 'b', 'd']

z = y[:]

z[1]=2

z == y

False

z[1]

2

y[1]

3

z = y[::-1] #Reverse order

z

['d', 'b', 'a', 'c', 3, 4.5]

Now let us look at dictionaries.

a = {'x' : 1, 'y' : z, 'z' : 'entry'}

a

{'x': 1, 'y': ['d', 'b', 'a', 'c', 3, 4.5], 'z': 'entry'}

a['x']

1

a['y'][3]

'c'

a.values()

dict_values([1, ['d', 'b', 'a', 'c', 3, 4.5], 'entry'])

a.keys()

dict_keys(['x', 'y', 'z'])

'abc'+'efg'

'abcefg'

'abc'[2]

'c'

'abcdef'[-2]='x' # strings are immutable (as usual)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-57-52dd521abc46> in <module>()
----> 1 'abcdef'[-2]='x' # strings are immutable (as usual)

TypeError: 'str' object does not support item assignment

'abc'.upper()

'ABC'

There are also tuples, which are non-transformable.

x = (1,2,3)

x

(1, 2, 3)

type(x)

tuple

x[2]

3

x[-2]=3 # Fails

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-fe8553fd2868> in <module>()
----> 1 x[-2]=3 # Fails

TypeError: 'tuple' object does not support item assignment

List and dictionary comprehensions¶

Okay, now some of my favorite features, list and dictionary comprehensions, which allow us to use syntax similar to the mathematician's set notation

w = [ a**2 for a in range(10)]

w

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Note that exponentiation in python is done with the symbol **

Also note that the range function works a bit like slicing.

[a for a in range(1,20,2)]

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

We can also select subsets:

[a for a in range(1,20) if a % 2 != 0]

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

An example with functions¶

Just for fun let's see how to build a simple encrypter. First let us import a variable of printable characters from the module string and denote it by chars.

from string import printable as chars

chars

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

lc = len(chars); lc

100

codebook = {chars[i] : chars[(i+lc//2)%lc] for i in range(lc)}

There is a couple new things going on in this previous line, so let's unpack it. First we have dictionary comprehension, which is defined like a list comprehension. We have used the integer division operateor // and the integer modulus operator %.

codebook['a']

'Y'

codebook['Y']

'a'

Now we are going to use one of the three core functions from functional programming (map, reduce, filter), namely reduce. This goes through a list item by item and applies a two variable function using an accumulated value for the first argument and the list element for the second. We only need to use this two variable function once, so we will use an anonymous/lambda function.

Finally, it is important to note the absence of brackets indicating the start and end of the function. Python accomplishes this using spacing. This is very unusual, but in Python spacing has meaning and if you use inconsistent spacing your program will not run.

from functools import reduce

def encode_decode(s):
    return reduce(lambda x,y: x+codebook[y],s,"")

encrypted = encode_decode('This is a secret message'); encrypted

"5&';I';IYI;#!:#<I+#;;Y%#"

encode_decode(encrypted)

'This is a secret message'

Numpy and Pandas¶

Unlike R, Python was not designed for statistical analysis. Python was designed as a general purpose high level programming language. However, one of Python's strongest features is an truly vast collection of easy to use libraries (called modules) that drastically simplify our lives.

Two key core pieces of R functionality are lacking. We do not have an analogue of vectors (efficient lists containing only one type of element), so we are also lacking matrices and tensors, which are just fancier vectors. We are also lacking the data frame abstraction which plays a central role in R.

Vector functionality comes from numpy which is usually imported as np. This provides fast vectors and vectorized operations and should be used when possible instead of lists of numerical data. Dataframes come from pandas which is usually imported as pd. Pandas builds on numpy and is part of the scipy ecosystem, which includes many numerical libraries including more advanced statistics and linear algebra functions. The scipy ecosystem also includes matplotlib which is a pretty complex/flexible plotting library. I should also mention scikit-learn which is a standard machine learning library (although surprisingly limited) is built on scipy.

import numpy as np

a=np.arange(10)

np.sin(a) # vectorized operation

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])

A useful numpy feature (although it takes some getting used to) is broadcasting, which is similar to functionality in R, which automatically converts an array of one shape into another shape when performing various operations according to these rules. Broadcasting can easiliy lead to bugs and confusion, so try to be careful.

a*2

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

list(range(10))*2

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

a*a

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

a.a

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-84-30d72b897685> in <module>()
----> 1 a.a

AttributeError: 'numpy.ndarray' object has no attribute 'a'

a.shape

(10,)

b=a.reshape(10,1)

b

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

b.T

array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

b.T.shape

(1, 10)

c=np.dot(a,b); c

array([285])

c.shape

(1,)

d=np.zeros(shape=(2,3)); d

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

e = np.ones_like(d); e

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

f = np.ndarray(shape = (2,3,4), buffer = np.array(list(range(24))),dtype = np.int)

f

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

f[1,2,3]

23

f[1,1:3,3]

array([19, 23])

f[:,1:3,3]

array([[ 7, 11],
       [19, 23]])

for x in f:
    print(x)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]

for outer in f:
    for inner in outer:
        for really_inner in inner:
            print(really_inner)

import pandas as pd

df = pd.read_csv("crypto-markets.csv")

df.head()

df.symbol.unique()

array(['BTC', 'ETH', 'XRP', ..., '9COIN', 'BT1', 'BT2'], dtype=object)

len(df.symbol.unique())

1369

df['symbol'].unique()

array(['BTC', 'ETH', 'XRP', ..., '9COIN', 'BT1', 'BT2'], dtype=object)

small_df = df.head(25)

small_df

small_df[['date', 'close']]

small_df[4:6]

small_df[4] # fails

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 4

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-111-c83d9289599e> in <module>()
----> 1 small_df[4] # fails

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 4

small_df.loc[4]

slug              bitcoin
symbol                BTC
name              Bitcoin
date           2013-05-02
ranknow                 1
open               116.38
high                125.6
low                 92.28
close              105.21
volume                  0
market         1292190000
close_ratio        0.3881
spread              33.32
Name: 4, dtype: object

small_df.loc[4,"open"]

116.38

small_df.iloc[4,4]

1

Pay attention to the syntax for referencing. Think of the loc and iloc objects as dictionaries which will pull up the relevant pieces of the data frame and allow slicing notation (which is now inclusive on both ends). The difference is that loc searches by name and iloc only searches by numerical index.

type(small_df.loc[4:4])

pandas.core.frame.DataFrame

type(small_df.loc[4])

pandas.core.series.Series

df['date'] = pd.to_datetime(df['date'])

df['date'].head()

0   2013-04-28
1   2013-04-29
2   2013-04-30
3   2013-05-01
4   2013-05-02
Name: date, dtype: datetime64[ns]

Select only the first few symbols.

mask = df['symbol'].isin(df['symbol'].unique()[1:5])
trim_df = df[mask]

from ggplot import *

/Users/justin/anaconda/envs/py36/lib/python3.6/site-packages/ggplot/utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
/Users/justin/anaconda/envs/py36/lib/python3.6/site-packages/ggplot/stats/smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
/Users/justin/anaconda/envs/py36/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

gg = ggplot(aes(x='date',y='close',color='symbol'),data = trim_df) + geom_line() + ggtitle("Cryptocurrency prices") + scale_y_log() + \
  scale_x_date() + ylab("Closing price (log-scale)") + xlab("Date")

gg.show()

	slug	symbol	name	date	ranknow	open	high	low	close	market	close_ratio	spread
0	bitcoin	BTC	Bitcoin	2013-04-28	1	135.30	135.98	132.10	134.21	1500520000	0.5438	3.88
1	bitcoin	BTC	Bitcoin	2013-04-29	1	134.44	147.49	134.00	144.54	1491160000	0.7813	13.49
2	bitcoin	BTC	Bitcoin	2013-04-30	1	144.00	146.93	134.05	139.00	1597780000	0.3843	12.88
3	bitcoin	BTC	Bitcoin	2013-05-01	1	139.00	139.89	107.72	116.99	1542820000	0.2882	32.17
4	bitcoin	BTC	Bitcoin	2013-05-02	1	116.38	125.60	92.28	105.21	1292190000	0.3881	33.32
5	bitcoin	BTC	Bitcoin	2013-05-03	1	106.25	108.13	79.10	97.75	1180070000	0.6424	29.03
6	bitcoin	BTC	Bitcoin	2013-05-04	1	98.10	115.00	92.50	112.50	1089890000	0.8889	22.50
7	bitcoin	BTC	Bitcoin	2013-05-05	1	112.90	118.80	107.14	115.91	1254760000	0.7521	11.66
8	bitcoin	BTC	Bitcoin	2013-05-06	1	115.98	124.66	106.64	112.30	1289470000	0.3141	18.02
9	bitcoin	BTC	Bitcoin	2013-05-07	1	112.25	113.44	97.70	111.50	1248470000	0.8767	15.74
10	bitcoin	BTC	Bitcoin	2013-05-08	1	109.60	115.78	109.60	113.57	1219450000	0.6424	6.18
11	bitcoin	BTC	Bitcoin	2013-05-09	1	113.20	113.46	109.26	112.67	1259980000	0.8119	4.20
12	bitcoin	BTC	Bitcoin	2013-05-10	1	112.80	122.00	111.55	117.20	1255970000	0.5407	10.45
13	bitcoin	BTC	Bitcoin	2013-05-11	1	117.70	118.68	113.01	115.24	1311050000	0.3933	5.67
14	bitcoin	BTC	Bitcoin	2013-05-12	1	115.64	117.45	113.44	115.00	1288630000	0.3890	4.01
15	bitcoin	BTC	Bitcoin	2013-05-13	1	114.82	118.70	114.50	117.98	1279980000	0.8286	4.20
16	bitcoin	BTC	Bitcoin	2013-05-14	1	117.98	119.80	110.25	111.50	1315720000	0.1309	9.55
17	bitcoin	BTC	Bitcoin	2013-05-15	1	111.40	115.81	103.50	114.22	1242760000	0.8708	12.31
18	bitcoin	BTC	Bitcoin	2013-05-16	1	114.22	118.76	112.20	118.76	1274620000	1.0000	6.56
19	bitcoin	BTC	Bitcoin	2013-05-17	1	118.21	125.30	116.57	123.02	1319590000	0.7388	8.73
20	bitcoin	BTC	Bitcoin	2013-05-18	1	123.50	125.25	122.30	123.50	1379140000	0.4068	2.95
21	bitcoin	BTC	Bitcoin	2013-05-19	1	123.21	124.50	119.57	121.99	1376370000	0.4909	4.93
22	bitcoin	BTC	Bitcoin	2013-05-20	1	122.50	123.62	120.12	122.00	1368910000	0.5371	3.50
23	bitcoin	BTC	Bitcoin	2013-05-21	1	122.02	123.00	121.21	122.88	1363940000	0.9330	1.79
24	bitcoin	BTC	Bitcoin	2013-05-22	1	122.89	124.00	122.00	123.89	1374130000	0.9450	2.00

	date	close
0	2013-04-28	134.21
1	2013-04-29	144.54
2	2013-04-30	139.00
3	2013-05-01	116.99
4	2013-05-02	105.21
5	2013-05-03	97.75
6	2013-05-04	112.50
7	2013-05-05	115.91
8	2013-05-06	112.30
9	2013-05-07	111.50
10	2013-05-08	113.57
11	2013-05-09	112.67
12	2013-05-10	117.20
13	2013-05-11	115.24
14	2013-05-12	115.00
15	2013-05-13	117.98
16	2013-05-14	111.50
17	2013-05-15	114.22
18	2013-05-16	118.76
19	2013-05-17	123.02
20	2013-05-18	123.50
21	2013-05-19	121.99
22	2013-05-20	122.00
23	2013-05-21	122.88
24	2013-05-22	123.89