Analyzing Patient Data
|
Import a library into a program using import libraryname .
Use the numpy library to work with arrays in Python.
Use variable = value to assign a value to a variable in order to record it in memory.
Variables are created on demand whenever a value is assigned to them.
Use print(something) to display the value of something .
The expression array.shape gives the shape of an array.
Use array[x, y] to select a single element from an array.
Array indices start at 0, not 1.
Use low:high to specify a slice that includes the indices from low to high-1 .
All the indexing and slicing that works on arrays also works on strings.
Use # some kind of explanation to add comments to programs.
Use numpy.mean(array) , numpy.max(array) , and numpy.min(array) to calculate simple statistics.
Use numpy.mean(array, axis=0) or numpy.mean(array, axis=1) to calculate statistics across the specified axis.
Use the pyplot library from matplotlib for creating simple visualizations.
|
Repeating Actions with Loops
|
Use for variable in sequence to process the elements of a sequence one at a time.
The body of a for loop must be indented.
Use len(thing) to determine the length of something that contains other values.
|
Storing Multiple Values in Lists
|
[value1, value2, value3, ...] creates a list.
Lists are indexed and sliced in the same way as strings and arrays.
Lists are mutable (i.e., their values can be changed in place).
Strings are immutable (i.e., the characters in them cannot be changed).
|
Analyzing Data from Multiple Files
|
Use glob.glob(pattern) to create a list of files whose names match a pattern.
Use * in a pattern to match zero or more characters, and ? to match any single character.
|
Making Choices
|
Use if condition to start a conditional statement, elif condition to provide additional tests, and else to provide a default.
The bodies of the branches of conditional statements must be indented.
Use == to test for equality.
X and Y is only true if both X and Y are true.
X or Y is true if either X or Y, or both, are true.
Zero, the empty string, and the empty list are considered false; all other numbers, strings, and lists are considered true.
Nest loops to operate on multi-dimensional data.
Put code whose parameters change frequently in a function, then call it with different parameter values to customize its behavior.
|
Creating Functions
|
Define a function using def name(...params...) .
The body of a function must be indented.
Call a function using name(...values...) .
Numbers are stored as integers or floating-point numbers.
Integer division produces the whole part of the answer (not the fractional part).
Each time a function is called, a new stack frame is created on the call stack to hold its parameters and local variables.
Python looks for variables in the current stack frame before looking for them at the top level.
Use help(thing) to view help for something.
Put docstrings in functions to provide help for that function.
Specify default values for parameters when defining a function using name=value in the parameter list.
Parameters can be passed by matching based on name, by position, or by omitting them (in which case the default value is used).
|
Errors and Exceptions
|
Tracebacks can look intimidating, but they give us a lot of useful information about what went wrong in our program, including where the error occurred and what type of error it was.
An error having to do with the ‘grammar’ or syntax of the program is called a SyntaxError . If the issue has to do with how the code is indented, then it will be called an IndentationError .
A NameError will occur if you use a variable that has not been defined, either because you meant to use quotes around a string, you forgot to define the variable, or you just made a typo.
Containers like lists and strings will generate errors if you try to access items in them that do not exist. This type of error is called an IndexError .
Trying to read a file that does not exist will give you an FileNotFoundError . Trying to read a file that is open for writing, or writing to a file that is open for reading, will give you an IOError .
|
Defensive Programming
|
Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do.
Put assertions in programs to check their state as they run, and to help readers understand how those programs are supposed to work.
Use preconditions to check that the inputs to a function are safe to use.
Use postconditions to check that the output from a function is safe to use.
Write tests before writing code in order to help determine exactly what that code is supposed to do.
|
Debugging
|
Know what code is supposed to do before trying to debug it.
Make it fail every time.
Make it fail fast.
Change one thing at a time, and for a reason.
Keep track of what you’ve done.
Be humble.
|
Command-Line Programs
|
The sys library connects a Python program to the system it is running on.
The list sys.argv contains the command-line arguments that a program was run with.
Avoid silent failures.
The pseudo-file sys.stdin connects to a program’s standard input.
The pseudo-file sys.stdout connects to a program’s standard output.
|
Starting With Pandas
|
|
Indexing, Slicing and Subsetting DataFrames in Python
|
|
Data Types and Formats
|
|
Combining DataFrames with pandas
|
|
Data workflows and automation
|
|
Selecting Data
|
A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
A database manager is a program that manipulates information stored in a database.
We write queries in a specialized language called SQL to extract information from databases.
Use SELECT… FROM… to get values from a database table.
SQL is case-insensitive (but data is case-sensitive).
|
Sorting and Removing Duplicates
|
The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY.
The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.
|
Filtering
|
Use WHERE to specify conditions that records must meet in order to be included in a query’s results.
Use AND, OR, and NOT to combine tests.
Filtering is done on whole records, so conditions can use fields that are not actually displayed.
Write queries incrementally.
|
Calculating New Values
|
|
Missing Data
|
Databases use a special value called NULL to represent missing information.
Almost all operations on NULL produce NULL.
Queries can test for NULLs using IS NULL and IS NOT NULL.
|
Aggregation
|
Use aggregation functions to combine multiple values.
Aggregation functions ignore null values.
Aggregation happens after filtering.
Use GROUP BY to combine subsets separately.
If no aggregation function is specified for a field, the query may return an arbitrary value for that field.
|
Combining Data
|
Use JOIN to combine data from two tables.
Use table.field notation to refer to fields when doing joins.
Every fact should be represented in a database exactly once.
A join produces all combinations of records from one table with records from another.
A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
The most common join condition is matching keys.
|
Data Hygiene
|
Every value in a database should be atomic.
Every record should have a unique primary key.
A database should not contain redundant information.
Units and similar metadata should be stored with the data.
|
Creating and Modifying Data
|
Use CREATE and DROP to create and delete tables.
Use INSERT to add data.
Use UPDATE to modify existing data.
Use DELETE to remove data.
It is simpler and safer to modify data when every record has a unique primary key.
Do not create dangling references by deleting records that other records refer to.
|