Pattern Recognition and Machine Learning (Information Science and Statistics)
The above book by Christopher M. Bishop is widely regarded as one of the most comprehensive books on Machine Learning. At over 700 pages, it has coverage of most machine learning and pattern recognition topics.
It is considered very rigorous for a machine learning (data science) book, but yet has a lighter touch than a pure mathematics or theoretical computer science book. Hence, it is perfect as a reference book or even textbook for students self learning the subject from the ground up (i.e. students who want to understand instead of just blindly apply algorithms).
A brief overview of the contents covered (taken from the contents page of the book):
Linear Models for Regression
Linear Models for Classification
Sparse Kernel Machines
Mixture Models and EM
Continuous Latent Variables
In Python (pandas), saving a .csv file to a particular folder is not that hard, but then it may be confusing to beginners.
The packages we need to import are:
import pandas as pd
Say, your folder name is called “myfolder”, and the dataframe you have is called “df”. To save it insider “myfolder” as “yourfilename.csv”, the following code does the job:
The reason this may be difficult for beginners is that beginners may not know of the existence of the os.path.join method, which is the recommended method for joining one or more path components.
Other than accuracy rate, there are various metrics for machine learning to measure how “accurate” the model is.
Some popular ones for binary classification are sensitivity (true positive rate) and specificity (true negative rate).
In computer science, recall and precision are also common metrics.
It can be quite confusing to remember offhand what each metric means, and how they are related.
To summarize, the following are equivalent (for binary classification):
sensitivity = recall of positive class
specificity = recall of negative class
Sample source: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cmdc.201700180
Other than the above metrics mentioned, there are also many other metrics, such as F1 score, etc.
This article is suitable for solving the following few problems:
- module ‘sklearn.tree’ has no attribute ‘plot_tree’
- pip install (on Spyder, Anaconda Prompt, etc.) does not install the latest package.
The leading reason for “module ‘sklearn.tree’ has no attribute ‘plot_tree” is because the sklearn package is outdated.
Sometimes “pip install scikit-learn” simply does not update the sklearn package to the latest version. Type “print(sklearn.__version__)” to get the version of sklearn on your machine, it should be at least 0.21.
The solution is to force pip to install the latest package:
pip install --no-cache-dir --upgrade <package>
In this case, we would replace <package> by “scikit-learn”.
Sometimes, pip install does not work in the Spyder IPython console, it displays an error to the effect that you should install “outside the IPython console”. This is not normal (i.e. it should not happen), but as a quick fix you can try “pip install” in Anaconda Prompt instead. It is likely that something wrong went on during the installation of Anaconda, Python, and the long-term solution is to reinstall Anaconda.
In the R language, often you have to convert variables to “factor” or “categorical”. There is a known issue in the ‘caret’ library that may cause errors when you do that in a certain way.
The correct way to convert variables to ‘factor’ is:
trainset$Churn = as.factor(trainset$Churn)
In particular, “the train() function in caret does not handle factor variables well” when you convert to factors using other methods.
Basically, if you use other ways to convert to ‘factor’, the code may still run, but there may be some ‘weird’ issues that leads to inaccurate predictions (for instance if you are doing logistic regression, decision trees, etc.)
The Scikit-Learn (sklearn) Python package has a nice function sklearn.tree.plot_tree to plot (decision) trees. The documentation is found here.
However, the default plot just by using the command
could be low resolution if you try to save it from a IDE like Spyder.
The solution is to first import matplotlib.pyplot:
import matplotlib.pyplot as plt
Then, the following code will allow you to save the sklearn tree as .eps (or you could change the format accordingly):
plt.savefig('tree.eps',format='eps',bbox_inches = "tight")
To elaborate, clf is your Decision Tree classifier (to be defined before plotting the tree):
# Example from https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(iris.data, iris.target)
The outcome is a Vector Graphics format (.eps) tree that will retain its full resolution when zoomed in. The bbox_inches=”tight” command prevents truncating of the image. Without that command, sometimes the sklearn tree will just be cropped off and be incomplete.
There are 3 main categories of Machine Learning: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
This is a nice and interesting to watch video on Unsupervised Learning:
Very good introduction to Machine Learning by Google. Google is the developer of Tensorflow (on which the Keras package is built). The other platform for Machine Learning is Pytorch by Facebook.
So far, the best introductory book to Machine Learning seems to be the one by the founder of Keras, called “Deep Learning with Python”. See also Best Machine Learning / Deep Learning Books.