Galvanize Data Science Immersive Program
Data Science Immersive Program (January 2017)
The following is the standard of Galvanize Inc’s 3 month fulltime immersive program. It resembles the body of knowledge obtained, though more time and work was needed for proficiency in application.
What is a Standard?
Standards are the corecompetencies of data scientists  the knowledge, skills, and habits every Galvanize graduate should possess. These were carefully crafted in a joint effort by your lead instructors, and represent those knowledge, skills, and habits we believe students need to get your foot in the door and be successful in industry.
Standards by Topic
 Python
 Explain the difference between mutable and immutable types and their relationship to dictionaries.
 Compare the strengths and weaknesses of lists vs. dictionaries.
 Choose the appropriate collection (dict, Counter, defaultdict) to simplify a problem.
 Compare the strengths and weaknesses of lists vs. generators.
 Write pythonic code.
 Version Control / Git
 Explain the basic function and purpose of version control.

Use a basic Git workflow to track project changes over time, share code, and write useful commit messages.
 OOP
 Given the code for a python class, instantiate a python object and call the methods and list the attributes.
 Write the python code for a simple class.
 Match key “magic” methods to their functionality.
 Design a program or algorithm in object oriented fashion.

Compare and contrast functional and object oriented programming.
 SQL
 Connect to a SQL database via command line (i.e. Postgres).
 Connect to a database from within a python program.
 State function of basic SQL commands.
 Write simple queries on a single table including SELECT, FROM, WHERE, CASE clauses and aggregates.
 Write complex queries including JOINS and subqueries.
 Explain how indexing works in Postgres.
 Create and dump tables.
 Format a query to follow a standard style.

Move data from SQL database to text file.
 Pandas
 Explain/use the relationship between DataFrame and Series
 Know how to set, reset indexes
 Use iloc, loc, ix, and iat appropriately
 Use index alignment and know when it applies
 Use SplitApplyCombine Methods
 Be able to read and write data to pandas
 Recognize problems that can probably be solved with Pandas (as opposed to writing vanilla Python functions).

Use basic DateTimeIndex functionality
 Plotting
 Describe the architecture of a matplotlib figure
 Plot in and outside of notebooks with matplotlib and seaborn
 Combine multiple datasets/categories in same plot
 Use subplots effectively
 Plot with Pandas
 Use and explain scatter_matrix output
 Use and explain a correlation heatmap
 Visualize pairwise relationships with seaborn
 Compare withinclass distributions

Use matplotlib techniques with seaborn
 Visualization
 Explain the difference between exploratory and explanatory visualizations.
 Explain what a visualization is
 Don’t lie with data
 Visualize multidimensional relationships with data using position, size, color, alpha, facets.

Create an explanatory visualization that makes a relationship in data explicit.
 Workflow
 Perform basic file operations from the command line, while consulting man/help/Google if necessary.
 Get help using man (ex man grep)
 Perform “survival” edits using vi, emacs, nano, or pico
 Configure environment & aliases in .bashrc/.bash_profile/.profile
 Install data science stack
 Manage a process with job control
 Examine system performance and kill processes
 Work on a remote machine with ssh/scp
 State what an RE (regular expression) is and write a simple one

State the features and use cases of grep/sed/awk/cut/paste to process/clean a text file
 Probability
 Define what a random variable is.
 Explain difference between permutations and combinations.
 Recite and perform major probability laws from memory: * Bayes Rule * LOTP * Chain Rule
 Recite and perform major random variable formulas from memory: * E(X) * Var(X) * Cov(X,Y)
 Describe what a joint distribution is and be able to perform a simple calculation using joint distribution.
 Define each major probability distributions and give 1 clear example of each
 Explain independence of 2 r.v.’s and implications with respect to probability formulas, covariance formulas, etc.
 Compute expectation of aX+bY and explain that it is a linear operator, where X and Y are random variables
 Compute variance of aX + bY
 Discuss why correlation is not causation

Describe correlation and its perils, with reference to Anscombe’s quartet
 Sampling
 Compute MLE estimate for simple example (such as coinflipping)
 Pseudocode Bootstrapping for a given sample of size N.
 Construct confidence interval for case where parametric construction does not work
 Discuss examples of times when you need bootstrapping.
 Define the Central Limit Theorem
 Compute standard error

Compare and contrast the use cases of parametric and nonparametric estimation
 Hypothesis Testing
 Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the pvalue for the difference of means or proportions.
 Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the pvalue for Chisquare test of independence
 Describe a situation in which a onetailed test would be appropriate (vs. a twotailed test).
 Given a particular situation, correctly choose among the following options: * ztest * ttest * 2 sample ttest (onesided and twosided) * 2 sample ztest (onesided and twosided)
 Define pvalue, Type I error, Type II error, significance level and discuss their significance in an example problem.
 Account for the multiple comparisons problem via Bonferroni correction.
 Compute the difference of two independent random normal variables.

Discuss when to use an A/B test to evaluate the efficacy of a treatment
 Power
 Define Power and relate it to the Type II error.
 Compute power given a dataset and a problem.
 Explain how the following factors contribute to power: * sample size * effect size (difference between sample statistics and statistic formulated under the null) * significance level
 Identify what can be done to increase power.
 Estimate sample size required of a test (power analysis) for one sample mean or proportion case
 Solve by hand for the posterior distribution for a uniform prior based on coin flips.
 Solve Discrete Bayes problem with some data
 What is the difference between Bayesian and Frequentist inference, with respect to fixed parameters and prior beliefs?
 Define power  Be able to draw the picture with two normal curves with different means and highlight the section that represents Power.

Explain trade off between significance and power
 Multi Armed Bandit
 Explain the difference between a frequentist A/B test and a Bayesian A/B test.
 Define and explain prior, likelihood, and posterior.
 Explain what a conjugate prior is and how it applies to A/B testing.
 Analyze an A/B test with the Bayesian approach.
 Explain how multiarmed bandit addresses the tradeoff between exploitation and exploration, and the relationship to regret.

Write pseudocode for the MultiArmed Bandit algorithm.
 Linear Algebra in Python
 Perform basic Linear Algebra operations by hand: Multiply matrices, subtract matrices, Transpose matrices, verify inverses.

Perform linear algebra operations (multiply matrices, transpose matrices, and invert matrices) in numpy.
 Exploratory Data Analysis (EDA)
 Define EDA in your own words.
 Identify the key questions of EDA.

Perform EDA on a dataset.
 Linear Regression
 State and troubleshoot the assumptions of linear regression model. Describe, interpret, and visualize the model form of linear regression: Y = B0+B1X1+B2X2+….
 Relate Beta vector solution of Ordinary Least Squares to the cost function (residual sum of squares)
 Perform ordinary least squares (OLS) with statsmodels and interpret the output: Beta coefficients, pvalues, R^2, adjustedR^2, AIC, BIC
 Explain how to incorporate interactions and categorical variables into linear regression

Explain how one can detect outliers
 Cross Validation & Regularized Linear Regression
 Perform (onefold) crossvalidation on dataset (train test splitting)
 Algorithmically, explain kfold crossvalidation
 Give the reasoning for using kfold crossvalidation
 Given one full model and one regularized model, name 2 appropriate ways to compare the two models. Name 1 inappropriate way.
 Generally, when we increase flexibility or complexity of model, what happens to bias? variance? training error? test error?
 Compare and contrast Lasso and Ridge regression.
 What happens to Bias and Variance as we change the following factors: sample size, number of parameters, etc.
 What is the cost function for Ridge? for Lasso?
 Build test error curve for Ridge regression, while varying the alpha parameter, to determine optimal level or regularization

Build and interpret Learning curves for two learning algorithms, one that is overfit (high variance, low bias) and one that is underfit (low variance, high bias)
 Logistic Regression
 Place logistic regression in the taxonomy of ML algorithms
 Fit and interpret a logistic regression model in scikitlearn
 Interpret the coefficients of logistic regression, using odds ratio
 Explain ROC curves

Explain the key differences and similarities between logistic and linear regression.
 Gradient Descent
 Identify and justify use cases for and failure modes of gradient descent.
 Write pseudocode of the gradient descent and stochastic gradient descent algorithms.

Compare and contrast batch and stochastic gradient descent  the algorithms, costs, and benefits.
 Decision Trees
 Thoroughly explain the construction of a decision tree (classification or regression), including selecting an impurity measure (gini, entropy, variance)
 Recognize overfitting and explain pre/post pruning and why it helps.
 Pick the ‘best’ tree via crossvalidation, for a given data set.

Discuss pros and cons
 kth nearest neighbor (kNN)
 Write pseudocode for the kNN algorithm from scratch
 State differences between kNN regression and classification

Discuss Pros and Cons of kNN
 Random Forest
 Thoroughly explain the construction of a random forest (classification or regression) algorithm
 Explain the relationship and difference between random forest and bagging.
 Explain why random forests are more accurate than a single decision tree.
 Explain how to get feature importances from a random forest using an algorithm

How is OOB error calculated and what is it an estimate of?
 Boosted Trees
 Define boosting in your own words.
 Be able to interpret boosting output
 List advantages and disadvantages of boosting.
 Compare and contrast boosting with other ensemble methods
 Explain each of the tuning parameters and specifically how they affect the model
 Learn, tune, and score a model using scikitlearn’s boosting class

Implement AdaBoost
 Support Vector Machines (SVM)
 Compute a hyperplane as a decision boundary in SVC
 Explain what a support vector is in plain english
 Recognize that preprocessing, specifically making sure all predictors are on the same scale, is a necessary step
 Explain SVC using the hyperparameter, C
 Tune a SVM with an RBF using both hyperparameters C and gamma
 Tune a SVM with a polynomial kernel using both hyperparameters C and degree
 Describe why generally speaking, an SVM with RBF kernel is more likely to perform well on “tall” data as opposed to “wide” data.
 For SVMs with RBF, state what happens to bias and variance as we increase the hyperparameter “C”. State what happens to bias and variance as we increase the hyperparameter “gamma”.
 State how the “onevsone” and “onevsrest” approaches for multiclass problems are implemented.

Describe the kernel trick, being able to calculate as if high dimensional space.
 Profit Curves
 Describe the issues with imbalanced classes.
 Explain the profit curve method for thresholding.
 Explain sampling methods and give examples of sampling methods.
 Explain how they deal with imbalanced classes.

Explain cost sensitive learning and how it deals with imbalanced classes.
 Webscraping
 Compare and contrast SQL and noSQL.
 Complete basic operations with mongo.
 Explain the basic concepts of HTML.
 Write python code to pull out an element from a web page.

Fetch data from an existing API
 Naive Bayes
 Derive the naive bayes algorithm and discuss its assumptions.
 Contrast generative and discriminative models.

Discuss the pros and cons of Naive Bayes.
 NLP
 Identify and explain ways of featurizing text.
 List and explain distance metrics used in document classification.

Featurize a text corpus in Python using nltk and scikitlearn.
 Clustering
 List the characteristics of a dataset necessary to perform Kmeans
 Detail the kmeans algorithm in steps, commenting on convergence or lack thereof.
 Use the elbow method to determine K and evaluate the choice
 Interpret Silhouette plot
 Interpret clusters by examining cluster centers, and exploring the data within each cluster (dataframe inspection, plotting, decision trees for cluster membership)
 Build and interpret a dendrogram using hierarchical clustering.

Compare and contrast kmeans and hierarchical clustering.
 Churn Case Study
 List and explain the steps in CRISPDM (CrossIndustry Standard Process for Data Mining)
 Perform EDA standards on case study including visualizations
 Discuss ramifications of deleting missing values when * MAR (missing at random) * MCAR (missing completely at random) * MNAR (missing not at random)
 Explain imputing missing using at least 2 different methods, list pros and cons of each method
 Explain when dropping rows is okay, when dropping features is okay?
 Be able to perform the feature engineering process
 Be able to identify target leak, and explain why this happens

State appropriate business goal and evaluation metric
 Dimensionality Reduction
 List reasons for reducing the dimensions.
 Describe how the principal components are constructed in PCA.
 Interpret the principal components of PCA.
 Determine how many principal components to keep.
 Describe the relationship between PCA and SVD.
 Compute and interpret PCA using sklearn.

Memorize the eigenvalue equation
 NMF
 Write down and explain the NMF equation.
 Compare and contrast NMF, SVD, and PCA, and kmeans
 Implement AlternatingLeastSquares algorithm for NMF
 Find and interpret latent topics in a corpus of documents with NMF
 Explain how to interpret H matrix? W matrix?

Explain regularization in the context of NMF.
 Recommender Systems
 Survey approaches to recommenders, their pros & cons, and when each is likely to be best.
 Describe the cold start problem and know how it affects different recommendation strategies
 Explain either the collaborative filtering algorithm or the matrix factorization recommender algorithm.
 Discuss recommender evaluation.

Discuss performance concerns for recommenders.
 Graphs
 Define a graph and discuss the implementation.
 List common applications of graph models.
 Discuss the searching algorithms and applications of them.
 Explain the various ways of measuring the importance of a node.
 Explain methods and applications of clustering on a graph.
 Use appropriate package to build graph data structure in Python and execute common algorithms (shortest path, connected components, …)
 Explain the various ways of measuring the importance of a node.

Explain methods and applications of clustering on a graph.
 Cloud Computing
 Scope & Configure a data science environment on AWS.
 Protect AWS resources against unauthorized access.
 Manage AWS resources using awscli, ssh, scp, or boto3.

Monitor and control costs incurred on AWS
 Parallel Computing
 Define and contrast processes vs. threads
 Define and contrast parallelism and concurrency.
 Recognize problems that require parallelism or concurrency
 Implement parallel and concurrent solutions

Instrument approaches to see the benefit of threading/parallelism.
 Map Reduce
 Explain Types of Problems which benefit from MapReduce
 Describe mapreduce, and how it relates to Hadoop
 Explain how to select the number of mappers and reducers
 Describe the role of keys in MapReduce

Perform MapReduce in python using MRJob.
 Time Series
 Recognize when time series analysis could be applied
 Define key times series concepts
 Determine structure of a timeseries using graphical tools
 Compute a forecast using BoxJenkins Methodology
 Evaluate models/forecasts using cross validation and statistical tests
 Engineer features to handle seasonal, calendar, and periodic components

Explain taxonomy of exponential smoothing using ETS framework
 Spark
 Configure a machine to use spark effectively
 Describe differences and similarities between MapReduce and Spark
 Get data into spark for processing.
 Describe lazy evaluation in the context of Spark.
 Cache RDDs effectively to improve performance.
 Use Spark to do compute basic statistics
 Know the difference between Spark data types: RDD, DataFrame, DAG

Use MLLib
 SQL in Spark
 Identify what distinguishes a Spark DataFrame from an RDD
 Explain how to create a Spark DataFrame
 Query a DF with SQL
 Transform a DF with dataframe methods
 Describe the challenges and requirements of saving schema’d datasets.

Use userdefined functions
 Data Products
 Explain REST architecture/API
 Write a basic Flask API
 Describe web architecture at a high level
 Know the role of javascript in a web application
 Know how to use developer tools to inspect an application
 Write a basic Flask web application

Be able to describe the difference between online and offline computation
 Fraud Case Study
 Build an MVP (minimum viable product) quickly
 Build a dashboard
 Build system to take in online data from a stream

Build productionquality product
 Whiteboarding
 Explain the meaning of BigOh.
 Analyze the runtime of code.
 Solve whiteboarding interview questions.

Apply different techniques to addressing a whiteboarding interview problem
 Business Analytics
 Explain funnel metrics and applications
 Identify red flags in a set of funnel metrics
 Identify and discuss appropriate use cases for cohort analysis
 Identify and explain the limits of data analysis
 Given an open ended question, identify the business goal, metrics, and relevant data science solution.
 Identify excessive or improper use of data analysis
 Explain how data science is used in industry
 Understand range of business problems where AB testing applies