Tools for Analysis

A Brief Introduction to Python

This resource assumes Python is already installed, and that the user has some familiarity with Python basics such as:

  • Variables

    • Assigning variables

    • Basic variable types (lists, dicts, strings, booleans, tuples)

  • Control flow

    • for loops

    • while loops

    • if/elif/else statements

  • Logical checks

    • E.g. ==, >, <, etc

  • Functions

    • Defining functions

    • Calling functions

  • Methods and attributes

    • Calling methods and attributes

    • (We will likely not define any methods or attributes, but it will be good to understand how to use them)

  • Comments

If you need more information regarding those, see these resources:

Handling Data

NumPy

  • NumPy (Numerical Python) is an open source Python library used in many areas of science and engineering.
    It utilizes multidimensional array and matrix data structures, similarly to MATLAB. NumPy is an object-oriented
    coding package that uses ndarray, and n-dimensional array object on which methods can operate.

Using NumPy on Your Computer

  • NumPy must both be installed and imported.

    • First, to install NumPy, open a Terminal window on your computer and type the following command: pip install numpy

    • If you have Anaconda installed, you can install NumPy using the command: conda install numpy

      • Note that Conda is the package manager for Anaconda whereas pip is the package manager for Python. You can use pip if you do not have Anaconda installed.

    • Now that you have installed NumPy, you need to import it. To import this package, type the following command into your Terminal window: import numpy as np. The imported name is shortened to np as a convention and for better readability of code in NumPy.

What’s an Array?

  • An array is sort of like a list in Python, but its size is initialized ahead of time. This means that the array can only hold as many elements as its initial size - it cannot be appended like a list can. An array is a homogenous data structure (all of the elements in an array are of the same data type). Each element in an array is numbered consequetively, beginning with 0. An array can be multidimensional.

  • One-dimensional array can be indexed, sliced, and interated over similarly to lists in Python. Here are some examples of one-dimensional arrays and their functionality:

import numpy as np
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_2183/3593848235.py in <module>
----> 1 import numpy as np

ModuleNotFoundError: No module named 'numpy'
# This array is manually initialized. It has a length of 7. Each element is filled with an int. 
array_one = np.array[1, 2, 3, 4, 5, 6, 7]
print(array_one, '\n')

# This array is initialized using a for loop. The loop begins with i as 0 and finishes with i as 5 
# (i is the index in the array). Each index has a value of 3.
array_two = np.array[3 for i in range(5)]
print(array_two, '\n')

# This array uses the object-oriented funtionality of NumPy by calling the empty() method. The empty()
# method takes the length of the array and element type as inputs. In this case, it returns an empty array of 
# length 10 which can be filled with strings.
array_three = np.empty(10, dtype=str)
print(array_three, '\n')

# This array demonstrates the algebraic functionality of NumPy. It outputs an array of length 4 with
# each element as the given string.
array_four = np.array['Hello World']*4
print(array_four, '\n')
[1, 2, 3, 4, 5, 6, 7] 

[3, 3, 3, 3, 3] 

['' '' '' '' '' '' '' '' '' ''] 

['Hello World', 'Hello World', 'Hello World', 'Hello World'] 
  • Here are some examples of you can initialize multi-dimensional arrays:

# Initialize an array of length 10 in dimension 1 and length 3 in dimension 2. 
# The array is filled with random integers in the range (0, 100).
multi_array_one = np.random.randint(100, size=(10,3))
print(multi_array_one, '\n')

# Initialize another array of length 10 in dimension 1 and length 3 in dimension 2. 
# The array is filled with random integers in the range (0, 100) and each element is then
# multiplied by 5.
multi_array_two = np.random.randint(100, size=(10,3))*5
print(multi_array_two, '\n')

# Initialize yet another array of length 10 in dimension 1 and length 3 in dimension 2,
# except this time each element is manually inputted.
multi_array_three = np.array([[55, 48, 40], [93, 58, 58],
                              [83, 57, 49], [39, 49, 34],
                              [92, 74, 20], [49, 37, 95],
                              [48, 30, 27], [60, 27, 59],
                              [29, 10, 44], [73, 93, 85]])
print(multi_array_three, '\n')
[[43 70 84]
 [16 19 61]
 [22 34 40]
 [35 81 37]
 [60 17  5]
 [ 0 93 49]
 [73 15 53]
 [47 12 85]
 [93 67 96]
 [23 95  0]] 

[[365 175  75]
 [485 475 470]
 [375   0 170]
 [220 155 465]
 [ 80 260 320]
 [435 430  30]
 [350  55 145]
 [395  50 450]
 [110 165 145]
 [110 330 275]] 

[[55 48 40]
 [93 58 58]
 [83 57 49]
 [39 49 34]
 [92 74 20]
 [49 37 95]
 [48 30 27]
 [60 27 59]
 [29 10 44]
 [73 93 85]] 

Accessing Elements in an Array

  • Let’s say you have a large dataset and you want to extract data points below a certain value:

# Initialize a dataset, this time it's a random array of integers in the range (0, 1000).
# The array has a suze of 7x5.
data_to_analyze = np.random.randint(1000, size=(5, 7))
print(data_to_analyze, '\n')

# Set the value threshold. This is the maximum value you want to extract from your dataset.
value_threshold = 500
# Initialize an empty list which will eventually contain all datapoints you want to keep. 
output_data = []

# Iterate through the dataset and append kept datapoints to the output list:
for row in data_to_analyze:
    for element in row:
        if element < value_threshold:
            output_data.append(element)
print(output_data)
[[892 857 509 818 873  20 324]
 [936 248 265 542 114 819 452]
 [926 500 489 284 614 421 305]
 [ 79 866 182 784 647 155 976]
 [297 796 189 448 647 264 463]] 

[20, 324, 248, 265, 114, 452, 489, 284, 421, 305, 79, 182, 155, 297, 189, 448, 264, 463]
  • We can also use np.where() to give us the indices where some logical statement is true.

indices_to_keep = np.where(data_to_analyze < value_threshold)
print(indices_to_keep, "\n")

output_data = data_to_analyze[indices_to_keep]
print(output_data, "\n")
(array([0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4]), array([5, 6, 1, 2, 4, 6, 2, 3, 5, 6, 0, 2, 5, 0, 2, 3, 5, 6])) 

[ 20 324 248 265 114 452 489 284 421 305  79 182 155 297 189 448 264 463] 
  • Note that the above example simply creates a list of data that you want to keep derived from a larger dataset. Let’s say that you still want to keep the datapoints only below a given threshold, but you wish to retain the same size dataset as the original:

print(data_to_analyze, '\n')

# Set the value threshold. This is the maximum value you want to extract from your dataset.
value_threshold = 500
# Initialize an empty array which will eventually contain all datapoints you want to keep. 
output_data = np.empty((5,7), dtype=float)

# Iterate through the dataset and append kept datapoints to the output list:
for i, row in enumerate(data_to_analyze):
    for j, element in enumerate(row):
        if element < value_threshold:
            output_data[i,j] = element
        else:
            output_data[i,j] = np.NaN
    
print(output_data)
[[892 857 509 818 873  20 324]
 [936 248 265 542 114 819 452]
 [926 500 489 284 614 421 305]
 [ 79 866 182 784 647 155 976]
 [297 796 189 448 647 264 463]] 

[[ nan  nan  nan  nan  nan  20. 324.]
 [ nan 248. 265.  nan 114.  nan 452.]
 [ nan  nan 489. 284.  nan 421. 305.]
 [ 79.  nan 182.  nan  nan 155.  nan]
 [297.  nan 189. 448.  nan 264. 463.]]

Plotting

Matplotlib

Seaborn

Statistical analysis

SciPy

Scikit-bio

Network analysis

Networkx