Welcome to Python!

I know what you are all thinking...finally!

Okay let's check out the basics of Python.

I am typing this inside of Jupyter notebook which yields a markdown/programming environment similar to R markdown.

First let us discuss the basics of Python. Here are our standard types:

In [11]:
3
Out[11]:
3
In [12]:
type(3)
Out[12]:
int
In [13]:
3.0
Out[13]:
3.0
In [14]:
type(3.0)
Out[14]:
float
In [15]:
type('c')
Out[15]:
str
In [16]:
type('ca')
Out[16]:
str
In [17]:
type("ca")
Out[17]:
str
In [18]:
True
Out[18]:
True
In [19]:
type(True)
Out[19]:
bool
In [20]:
type(T) #Not defined unlike R
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-bb69594bccd4> in <module>()
----> 1 type(T) #Not defined unlike R

NameError: name 'T' is not defined
In [21]:
type(true)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-83ab3fb73e0b> in <module>()
----> 1 type(true)

NameError: name 'true' is not defined
In [22]:
type(x=3) #An assignment does not return a value. This is different from C/C++/R.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-b8133a4db803> in <module>()
----> 1 type(x=3) #An assignment does not return a value. This is different from C/C++/R.

TypeError: type() takes 1 or 3 arguments
In [23]:
x=2 #assignment
In [24]:
x
Out[24]:
2
In [25]:
x==3 #Boolean
Out[25]:
False

Lists and dictionaries

Okay let's check out some syntetic data structures.

In [26]:
y = [4.5,x, 'c'] #lists can contain different types
In [27]:
type(y)
Out[27]:
list
In [28]:
y[0] #zero indexing
Out[28]:
4.5
In [29]:
y[1]
Out[29]:
2
In [30]:
y[-1] #last entry
Out[30]:
'c'
In [31]:
y[-2] 
Out[31]:
2
In [32]:
len(y)
Out[32]:
3
In [33]:
y = y + ['a','b','d']
In [34]:
y
Out[34]:
[4.5, 2, 'c', 'a', 'b', 'd']
In [35]:
y[1:3] #Slicing!
Out[35]:
[2, 'c']
In [36]:
y[1:4]
Out[36]:
[2, 'c', 'a']
In [37]:
y[1:6:2] # jump by twos
Out[37]:
[2, 'a', 'd']
In [38]:
y[:] #copy entire list
Out[38]:
[4.5, 2, 'c', 'a', 'b', 'd']
In [39]:
z = y
In [40]:
z[1]=3
In [41]:
y
Out[41]:
[4.5, 3, 'c', 'a', 'b', 'd']
In [42]:
z = y[:]
In [43]:
z[1]=2
In [44]:
z == y
Out[44]:
False
In [45]:
z[1]
Out[45]:
2
In [46]:
y[1]
Out[46]:
3
In [47]:
z = y[::-1] #Reverse order
In [48]:
z
Out[48]:
['d', 'b', 'a', 'c', 3, 4.5]

Now let us look at dictionaries.

In [49]:
a = {'x' : 1, 'y' : z, 'z' : 'entry'}
In [50]:
a
Out[50]:
{'x': 1, 'y': ['d', 'b', 'a', 'c', 3, 4.5], 'z': 'entry'}
In [51]:
a['x']
Out[51]:
1
In [52]:
a['y'][3]
Out[52]:
'c'
In [53]:
a.values()
Out[53]:
dict_values([1, ['d', 'b', 'a', 'c', 3, 4.5], 'entry'])
In [54]:
a.keys()
Out[54]:
dict_keys(['x', 'y', 'z'])
In [55]:
'abc'+'efg'
Out[55]:
'abcefg'
In [56]:
'abc'[2]
Out[56]:
'c'
In [57]:
'abcdef'[-2]='x' # strings are immutable (as usual)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-57-52dd521abc46> in <module>()
----> 1 'abcdef'[-2]='x' # strings are immutable (as usual)

TypeError: 'str' object does not support item assignment
In [58]:
'abc'.upper()
Out[58]:
'ABC'

There are also tuples, which are non-transformable.

In [59]:
x = (1,2,3)
In [60]:
x
Out[60]:
(1, 2, 3)
In [61]:
type(x)
Out[61]:
tuple
In [62]:
x[2]
Out[62]:
3
In [63]:
x[-2]=3 # Fails
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-fe8553fd2868> in <module>()
----> 1 x[-2]=3 # Fails

TypeError: 'tuple' object does not support item assignment

List and dictionary comprehensions

Okay, now some of my favorite features, list and dictionary comprehensions, which allow us to use syntax similar to the mathematician's set notation

In [64]:
w = [ a**2 for a in range(10)]
In [65]:
w
Out[65]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Note that exponentiation in python is done with the symbol **

Also note that the range function works a bit like slicing.

In [66]:
[a for a in range(1,20,2)]
Out[66]:
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

We can also select subsets:

In [67]:
[a for a in range(1,20) if a % 2 != 0]
Out[67]:
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

An example with functions

Just for fun let's see how to build a simple encrypter. First let us import a variable of printable characters from the module string and denote it by chars.

In [68]:
from string import printable as chars
In [69]:
chars
Out[69]:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
In [70]:
lc = len(chars); lc
Out[70]:
100
In [71]:
codebook = {chars[i] : chars[(i+lc//2)%lc] for i in range(lc)}

There is a couple new things going on in this previous line, so let's unpack it. First we have dictionary comprehension, which is defined like a list comprehension. We have used the integer division operateor // and the integer modulus operator %.

In [72]:
codebook['a']
Out[72]:
'Y'
In [73]:
codebook['Y']
Out[73]:
'a'

Now we are going to use one of the three core functions from functional programming (map, reduce, filter), namely reduce. This goes through a list item by item and applies a two variable function using an accumulated value for the first argument and the list element for the second. We only need to use this two variable function once, so we will use an anonymous/lambda function.

Finally, it is important to note the absence of brackets indicating the start and end of the function. Python accomplishes this using spacing. This is very unusual, but in Python spacing has meaning and if you use inconsistent spacing your program will not run.

In [74]:
from functools import reduce
In [75]:
def encode_decode(s):
    return reduce(lambda x,y: x+codebook[y],s,"")
In [76]:
encrypted = encode_decode('This is a secret message'); encrypted
Out[76]:
"5&';I';IYI;#!:#<I+#;;Y%#"
In [77]:
encode_decode(encrypted)
Out[77]:
'This is a secret message'

Numpy and Pandas

Unlike R, Python was not designed for statistical analysis. Python was designed as a general purpose high level programming language. However, one of Python's strongest features is an truly vast collection of easy to use libraries (called modules) that drastically simplify our lives.

Two key core pieces of R functionality are lacking. We do not have an analogue of vectors (efficient lists containing only one type of element), so we are also lacking matrices and tensors, which are just fancier vectors. We are also lacking the data frame abstraction which plays a central role in R.

Vector functionality comes from numpy which is usually imported as np. This provides fast vectors and vectorized operations and should be used when possible instead of lists of numerical data. Dataframes come from pandas which is usually imported as pd. Pandas builds on numpy and is part of the scipy ecosystem, which includes many numerical libraries including more advanced statistics and linear algebra functions. The scipy ecosystem also includes matplotlib which is a pretty complex/flexible plotting library. I should also mention scikit-learn which is a standard machine learning library (although surprisingly limited) is built on scipy.

In [78]:
import numpy as np
In [79]:
a=np.arange(10)
In [80]:
np.sin(a) # vectorized operation
Out[80]:
array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])

A useful numpy feature (although it takes some getting used to) is broadcasting, which is similar to functionality in R, which automatically converts an array of one shape into another shape when performing various operations according to these rules. Broadcasting can easiliy lead to bugs and confusion, so try to be careful.

In [81]:
a*2
Out[81]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
In [82]:
list(range(10))*2
Out[82]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [83]:
a*a
Out[83]:
array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])
In [84]:
a.a
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-84-30d72b897685> in <module>()
----> 1 a.a

AttributeError: 'numpy.ndarray' object has no attribute 'a'
In [85]:
a.shape
Out[85]:
(10,)
In [86]:
b=a.reshape(10,1)
In [87]:
b
Out[87]:
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])
In [88]:
b.T
Out[88]:
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [89]:
b.T.shape
Out[89]:
(1, 10)
In [90]:
c=np.dot(a,b); c
Out[90]:
array([285])
In [91]:
c.shape
Out[91]:
(1,)
In [92]:
d=np.zeros(shape=(2,3)); d
Out[92]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [93]:
e = np.ones_like(d); e
Out[93]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [94]:
f = np.ndarray(shape = (2,3,4), buffer = np.array(list(range(24))),dtype = np.int)
In [95]:
f
Out[95]:
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
In [96]:
f[1,2,3]
Out[96]:
23
In [97]:
f[1,1:3,3]
Out[97]:
array([19, 23])
In [98]:
f[:,1:3,3]
Out[98]:
array([[ 7, 11],
       [19, 23]])
In [99]:
for x in f:
    print(x)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]
In [100]:
for outer in f:
    for inner in outer:
        for really_inner in inner:
            print(really_inner)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
In [101]:
import pandas as pd
In [102]:
df = pd.read_csv("crypto-markets.csv")
In [103]:
df.head()
Out[103]:
slug symbol name date ranknow open high low close volume market close_ratio spread
0 bitcoin BTC Bitcoin 2013-04-28 1 135.30 135.98 132.10 134.21 0 1500520000 0.5438 3.88
1 bitcoin BTC Bitcoin 2013-04-29 1 134.44 147.49 134.00 144.54 0 1491160000 0.7813 13.49
2 bitcoin BTC Bitcoin 2013-04-30 1 144.00 146.93 134.05 139.00 0 1597780000 0.3843 12.88
3 bitcoin BTC Bitcoin 2013-05-01 1 139.00 139.89 107.72 116.99 0 1542820000 0.2882 32.17
4 bitcoin BTC Bitcoin 2013-05-02 1 116.38 125.60 92.28 105.21 0 1292190000 0.3881 33.32
In [104]:
df.symbol.unique()
Out[104]:
array(['BTC', 'ETH', 'XRP', ..., '9COIN', 'BT1', 'BT2'], dtype=object)
In [105]:
len(df.symbol.unique())
Out[105]:
1369
In [106]:
df['symbol'].unique()
Out[106]:
array(['BTC', 'ETH', 'XRP', ..., '9COIN', 'BT1', 'BT2'], dtype=object)
In [107]:
small_df = df.head(25)
In [108]:
small_df
Out[108]:
slug symbol name date ranknow open high low close volume market close_ratio spread
0 bitcoin BTC Bitcoin 2013-04-28 1 135.30 135.98 132.10 134.21 0 1500520000 0.5438 3.88
1 bitcoin BTC Bitcoin 2013-04-29 1 134.44 147.49 134.00 144.54 0 1491160000 0.7813 13.49
2 bitcoin BTC Bitcoin 2013-04-30 1 144.00 146.93 134.05 139.00 0 1597780000 0.3843 12.88
3 bitcoin BTC Bitcoin 2013-05-01 1 139.00 139.89 107.72 116.99 0 1542820000 0.2882 32.17
4 bitcoin BTC Bitcoin 2013-05-02 1 116.38 125.60 92.28 105.21 0 1292190000 0.3881 33.32
5 bitcoin BTC Bitcoin 2013-05-03 1 106.25 108.13 79.10 97.75 0 1180070000 0.6424 29.03
6 bitcoin BTC Bitcoin 2013-05-04 1 98.10 115.00 92.50 112.50 0 1089890000 0.8889 22.50
7 bitcoin BTC Bitcoin 2013-05-05 1 112.90 118.80 107.14 115.91 0 1254760000 0.7521 11.66
8 bitcoin BTC Bitcoin 2013-05-06 1 115.98 124.66 106.64 112.30 0 1289470000 0.3141 18.02
9 bitcoin BTC Bitcoin 2013-05-07 1 112.25 113.44 97.70 111.50 0 1248470000 0.8767 15.74
10 bitcoin BTC Bitcoin 2013-05-08 1 109.60 115.78 109.60 113.57 0 1219450000 0.6424 6.18
11 bitcoin BTC Bitcoin 2013-05-09 1 113.20 113.46 109.26 112.67 0 1259980000 0.8119 4.20
12 bitcoin BTC Bitcoin 2013-05-10 1 112.80 122.00 111.55 117.20 0 1255970000 0.5407 10.45
13 bitcoin BTC Bitcoin 2013-05-11 1 117.70 118.68 113.01 115.24 0 1311050000 0.3933 5.67
14 bitcoin BTC Bitcoin 2013-05-12 1 115.64 117.45 113.44 115.00 0 1288630000 0.3890 4.01
15 bitcoin BTC Bitcoin 2013-05-13 1 114.82 118.70 114.50 117.98 0 1279980000 0.8286 4.20
16 bitcoin BTC Bitcoin 2013-05-14 1 117.98 119.80 110.25 111.50 0 1315720000 0.1309 9.55
17 bitcoin BTC Bitcoin 2013-05-15 1 111.40 115.81 103.50 114.22 0 1242760000 0.8708 12.31
18 bitcoin BTC Bitcoin 2013-05-16 1 114.22 118.76 112.20 118.76 0 1274620000 1.0000 6.56
19 bitcoin BTC Bitcoin 2013-05-17 1 118.21 125.30 116.57 123.02 0 1319590000 0.7388 8.73
20 bitcoin BTC Bitcoin 2013-05-18 1 123.50 125.25 122.30 123.50 0 1379140000 0.4068 2.95
21 bitcoin BTC Bitcoin 2013-05-19 1 123.21 124.50 119.57 121.99 0 1376370000 0.4909 4.93
22 bitcoin BTC Bitcoin 2013-05-20 1 122.50 123.62 120.12 122.00 0 1368910000 0.5371 3.50
23 bitcoin BTC Bitcoin 2013-05-21 1 122.02 123.00 121.21 122.88 0 1363940000 0.9330 1.79
24 bitcoin BTC Bitcoin 2013-05-22 1 122.89 124.00 122.00 123.89 0 1374130000 0.9450 2.00
In [109]:
small_df[['date', 'close']]
Out[109]:
date close
0 2013-04-28 134.21
1 2013-04-29 144.54
2 2013-04-30 139.00
3 2013-05-01 116.99
4 2013-05-02 105.21
5 2013-05-03 97.75
6 2013-05-04 112.50
7 2013-05-05 115.91
8 2013-05-06 112.30
9 2013-05-07 111.50
10 2013-05-08 113.57
11 2013-05-09 112.67
12 2013-05-10 117.20
13 2013-05-11 115.24
14 2013-05-12 115.00
15 2013-05-13 117.98
16 2013-05-14 111.50
17 2013-05-15 114.22
18 2013-05-16 118.76
19 2013-05-17 123.02
20 2013-05-18 123.50
21 2013-05-19 121.99
22 2013-05-20 122.00
23 2013-05-21 122.88
24 2013-05-22 123.89
In [110]:
small_df[4:6]
Out[110]:
slug symbol name date ranknow open high low close volume market close_ratio spread
4 bitcoin BTC Bitcoin 2013-05-02 1 116.38 125.60 92.28 105.21 0 1292190000 0.3881 33.32
5 bitcoin BTC Bitcoin 2013-05-03 1 106.25 108.13 79.10 97.75 0 1180070000 0.6424 29.03
In [111]:
small_df[4] # fails
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 4

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-111-c83d9289599e> in <module>()
----> 1 small_df[4] # fails

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

~/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 4
In [112]:
small_df.loc[4]
Out[112]:
slug              bitcoin
symbol                BTC
name              Bitcoin
date           2013-05-02
ranknow                 1
open               116.38
high                125.6
low                 92.28
close              105.21
volume                  0
market         1292190000
close_ratio        0.3881
spread              33.32
Name: 4, dtype: object
In [113]:
small_df.loc[4,"open"]
Out[113]:
116.38
In [114]:
small_df.iloc[4,4]
Out[114]:
1

Pay attention to the syntax for referencing. Think of the loc and iloc objects as dictionaries which will pull up the relevant pieces of the data frame and allow slicing notation (which is now inclusive on both ends). The difference is that loc searches by name and iloc only searches by numerical index.

In [115]:
type(small_df.loc[4:4])
Out[115]:
pandas.core.frame.DataFrame
In [116]:
type(small_df.loc[4])
Out[116]:
pandas.core.series.Series
In [117]:
df['date'] = pd.to_datetime(df['date'])
In [118]:
df['date'].head()
Out[118]:
0   2013-04-28
1   2013-04-29
2   2013-04-30
3   2013-05-01
4   2013-05-02
Name: date, dtype: datetime64[ns]

Select only the first few symbols.

In [119]:
mask = df['symbol'].isin(df['symbol'].unique()[1:5])
trim_df = df[mask]
In [120]:
from ggplot import *
/Users/justin/anaconda/envs/py36/lib/python3.6/site-packages/ggplot/utils.py:81: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
/Users/justin/anaconda/envs/py36/lib/python3.6/site-packages/ggplot/stats/smoothers.py:4: FutureWarning: The pandas.lib module is deprecated and will be removed in a future version. These are private functions and can be accessed from pandas._libs.lib instead
  from pandas.lib import Timestamp
/Users/justin/anaconda/envs/py36/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [121]:
gg = ggplot(aes(x='date',y='close',color='symbol'),data = trim_df) + geom_line() + ggtitle("Cryptocurrency prices") + scale_y_log() + \
  scale_x_date() + ylab("Closing price (log-scale)") + xlab("Date")
In [122]:
gg.show()