Machine Learning with sklearn

Reading

Browse this tutorial.

Key ideas

About machine learning

Supervised learning
Feature variables and target variable
Classification vs. regression
Linear regression
Decision trees

Sklearn

The first machine learning method we will look at is linear regression. See the sklearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
We will also look at a specific measure of error, mean squared log error. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html
We wil learn a second machine learning method: Decision Trees. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

Data for today: train.csv and test.csv, both from this Kaggle competition.

Some commands we will run

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.columns

train['SalePrice'].head()

train['SalePrice'].describe()

plt.hist(train['SalePrice'], bins=25)

train.dtypes

train['Fireplaces'].describe()

plt.plot(train['FullBath'], train['SalePrice'], 'bo', alpha=0.3)


from sklearn.linear_model import LinearRegression

reg_object = LinearRegression()

X = np.array(train['OverallQual']).reshape(-1,1)
y = train['SalePrice']

reg_object.fit(X, y)
reg_object.predict(np.array(1).reshape(-1,1))
reg_object.score(X, y)

X_test = np.array(test['OverallQual']).reshape(-1,1)


#same metric as kaggle:
# note that we must not have negatives in predictions

predictions = reg_object.predict(X)
from sklearn.metrics import mean_squared_log_error
np.sqrt(mean_squared_log_error( y, predictions.clip(min=0)))

X = np.array(list(zip(train['OverallQual'], train['FullBath'])))

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.20, random_state=42)

from sklearn.tree import DecisionTreeRegressor