Data Science

Regression-Discontinuity Analysis

  • https://conjointly.com/kb/regression-discontinuity-analysis/

Deep Learning with Structured Data

  • https://www.manning.com/books/deep-learning-with-structured-data

Data Analysis

  • https://towardsdatascience.com/the-six-types-of-data-analysis-75517ba7ea61
  • https://digital.gov/2015/04/16/using-a-hypothesis-driven-approach-in-analyzing-and-making-sense-of-your-website-traffic-data/
  • https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4552232/
  • https://towardsdev.com/exploratory-data-analysis-a-walkthrough-using-python-libraries-670ad3cf3659

Data Viz

  • https://extremepresentation.typepad.com/blog/2008/06/index.html

Intro to statistics

  • https://michael-bar.github.io/Introduction-to-statistics/

Transformers

  • https://e2eml.school/transformers.html
  • https://jalammar.github.io/illustrated-transformer/
  • https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  • https://nlp.seas.harvard.edu/2018/04/03/attention.html
  • https://aman.ai/primers/ai/transformers/
  • https://nn.labml.ai/transformers/index.html

Hypothesis Testing

  • https://www.visual-design.net/post/an-interactive-guide-to-hypothesis-testing-in-python

Data Profiling

  • https://oralytics.com/2022/04/04/python-data-profiling-libraries/

Machine Learning for Retail Demand Forecasting

  • https://towardsdatascience.com/machine-learning-for-store-demand-forecasting-and-inventory-optimization-part-1-xgboost-vs-9952d8303b48

SQL Visualizer

  • https://sqlflow.gudusoft.com/#/

JSON Visualizer

  • https://jsoncrack.com/editor

Public APIs

https://github.com/public-apis/public-apis

30 books to learn advanced math

https://abakcus.com/30-best-math-books-to-learn-advanced-mathematics-for-self-learners/

Quantile Regression

https://towardsdatascience.com/quantile-regression-ff2343c4a03


Data Analysis and Visualization Course Notes

Udemy - Python Data Analysis & Visualization Masterclass - Colt Steele

Question Answer
Where does the name pandas come from? The econometrics term, “panel data”, which refers to multidimensional data sets.
What is pandas? Open-source data analysis and manipulation package for Python.
How do you provide column names while reading a dataset that doesn’t come with it? pd.read_csv(“filename.csv”, names=[“col1”, “col2”, …, “colN”])
What’s the output of df.info() ? df shape, column names, column data types, null count per column, memory usage
What’s a pandas Dataframe? 2D, size-mutable, potentially heterogeneous tablular data structure with labeled axes.
How do you generate descriptive statistics for a string columns? df.describe(include=[‘object’])
What are the return types of the following dataframe methods? info(), head()/tail(), describe(), mode(), sum() NoneType, DataFrame, DataFrame, DataFrame, Series
Can you give a few examples where the dot notation for selecting columns doesn’t work? When the column name has either a space or a dot in it, the dot notation doesn’t work. The dot notation doesn’t allow us to select multiple columns at the same time. The dot notation is also not helpful when we reference the column name through a variable. Another situation where the dot notation isn’t helpful is when the column name is the same as a dataframe method. Eg. a column with name “head”
What’s a pandas Series? class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False). One-dimensional ndarray with axis labels (including time series).
What are the properties of Labels in a pandas Series? Labels need not be unique but must be a hashable type.
What data types does the Series object support? The Series object supports both integer- and label-based indexing.
What kind of methods does a pandas Series provide? Series provides methods for performing operations involving the index.
What data structure is the pandas Series built on? Numpy array.
How do the statistical methods in a pandas Series deal with NaNs? Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Why is it not necessary for pandas Series to be of same length for performing operations on them? Operations between Series (+, -, /, *, **) align values based on their associated index values. So, they need not be the same length.
What would be the result index of an operation on two Series objects? The result index will be the sorted union of the two indexes.
How does pandas Series handle different dtypes in the same series? When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype. Try: s = pd.Series([1, 2.0, "3"]), type(s[0]), type(s[1]), type(s[2])
How do nlargest and nsmallest work on DataFrames? df.nlargest(n, [“col1”, “col2”, …, “colN”]) - returns top n rows from dataframe sorted by col1, col2, …, colN
Analysis Operations Steps Methods Documentation Source Code
1. Read the dataset as a DataFrame pd.read_csv(“filename.csv”) https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py
2. Inspect the DataFrame df.info(), df.head(), df.tail(), df.dtypes https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html https://github.com/pandas-dev/pandas/blob/main/pandas/core/frame.py
3. Generate Descriptive Statistics df.describe(), df.sum(), df.mode(), df.describe(include=[‘object’]) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html https://github.com/pandas-dev/pandas/blob/main/pandas/core/describe.py
4. Work with individual columns col = df[“COLUMN NAME”], col.min(), col.max(), col.shape, col.values, col.index, col.head(), col.tail(), col.describe(), col.unique(dropna=True), col.nunique(), col.nlargest(n=5, keep=”first”), col.nsmallest(n=5, keep=”first”), col.value_counts(), col.plot(), col.dtype    

LinkedIn Posts

Daniel Lee • Following 🚀 Land Top Jobs with DataInterview.com (Ex-Google) 3h • Edited • 3 hours ago

While cleaning my closet, I found a stack of lecture notes I collected over the course of 8 years of studying data science.

These notes detail the theory, formulas and application of various subjects including matrix algebra, Neural networks, Bayesian statistics, experimental design and recommender systems.

Some of these pages have coffee stains from the early days of my career when I went to Starbucks right after work to learn about data science.

It’s a great reminder of why I love data science. I am glad that I. had a chance to study it as an undergraduate at Virginia Tech, apply it as a data scientist at Google, and teach it as an interview coach on DataInterview.com.

If you are wondering what resources I personally used to learn data science, here’s a detailed list:

✅ Machine Learning

  • Stanford CS224 (NLP)
  • Stanford CS229 (ML)
  • Stanford CS231 (Computer Vision)
  • Intro to Deep Learning by Lex Fridman
  • Deep Learning by Ian Goodfellow

✅ Statistics

  • Probability by Jim Pitman
  • Penn State Design of Experiments
  • Time Series Analysis & Forecasting by Douglas C. Montgomery

✅ Software Engineering

  • Harvard CS50 (Intro to Computer Science)
  • Effective Python by Brett Slatkin
  • Seven Databases in Seven Weeks
  • Stanford SQL & DB course
  • Automate the Boring Stuff with Python

✅ Specializations

  • Fraud Analytics by Bart Baesens
  • Practical Recommender Systems by Kim Falk

Other notable mentions include Khan Academy, Udx, Coursera and Udacity.

Fixed Effects / Random Effects / Mixed Models and Omitted Variable Bias

Source: https://www.statisticshowto.com/experimental-design/fixed-effects-random-mixed-omitted-variable-bias/