This is just to highlight that the Anaconda Python Distribution does not work with the latest MacOS Catalina. I only realized upon trying to open Anaconda Navigator, after installing Catalina.
The only (good) solution seems to be reinstalling Anaconda.
MacOS Catalina was released on October 7, 2019, and has been causing quite a stir for Anaconda users. Apple has decided that Anaconda’s default install location in the root folder is not allowed. It moves that folder into a folder on your desktop called “Relocated Items,” in the Security folder. If you’ve used the .pkg installer for Anaconda, this probably broke your Anaconda installation. Many users discuss the breakage at https://github.com/ContinuumIO/anaconda-issues/issues/10998.
Pattern Recognition and Machine Learning (Information Science and Statistics)
The above book by Christopher M. Bishop is widely regarded as one of the most comprehensive books on Machine Learning. At over 700 pages, it has coverage of most machine learning and pattern recognition topics.
It is considered very rigorous for a machine learning (data science) book, but yet has a lighter touch than a pure mathematics or theoretical computer science book. Hence, it is perfect as a reference book or even textbook for students self learning the subject from the ground up (i.e. students who want to understand instead of just blindly apply algorithms).
A brief overview of the contents covered (taken from the contents page of the book):
Linear Models for Regression
Linear Models for Classification
Sparse Kernel Machines
Mixture Models and EM
Continuous Latent Variables
Most people have heard of chi-squared test, but not many know that there are (at least) two types of chi-squared tests.
The two most common chi-squared tests are:
- 1-way classification: Goodness-of-fit test
- 2-way classification: Contingency test
The goodness-of-fit chi-squared test is to test proportions, or to be precise, to test if an an observed distribution fits an expected distribution.
The contingency test (the more classical type of chi-squared test) is to test the independence or relatedness of two random variables.
The best website I found regarding how to practically code (in R) for the two chi-squared tests is: https://web.stanford.edu/class/psych252/cheatsheets/chisquare.html
I created a PDF copy of the above site, in case it becomes unavailable in the future:
Chi-squared Stanford PDF
Best Videos on each type of Chi-squared test
Goodness of fit Chi-squared test video by Khan Academy:
Contingency table chi-square test:
Most of the time, users of R and Python will rely on packages and libraries as far as possible, in order to avoid “reinventing the wheel”. Packages that are established are also often superior and preferred, due to lower chance of errors and bugs.
We list down the most popular and useful packages in R and Python for data science, statistics, and machine learning.
Packages in R
This article is suitable for solving the following few problems:
- module ‘sklearn.tree’ has no attribute ‘plot_tree’
- pip install (on Spyder, Anaconda Prompt, etc.) does not install the latest package.
The leading reason for “module ‘sklearn.tree’ has no attribute ‘plot_tree” is because the sklearn package is outdated.
Sometimes “pip install scikit-learn” simply does not update the sklearn package to the latest version. Type “print(sklearn.__version__)” to get the version of sklearn on your machine, it should be at least 0.21.
The solution is to force pip to install the latest package:
pip install --no-cache-dir --upgrade <package>
In this case, we would replace <package> by “scikit-learn”.
Sometimes, pip install does not work in the Spyder IPython console, it displays an error to the effect that you should install “outside the IPython console”. This is not normal (i.e. it should not happen), but as a quick fix you can try “pip install” in Anaconda Prompt instead. It is likely that something wrong went on during the installation of Anaconda, Python, and the long-term solution is to reinstall Anaconda.
The Scikit-Learn (sklearn) Python package has a nice function sklearn.tree.plot_tree to plot (decision) trees. The documentation is found here.
However, the default plot just by using the command
could be low resolution if you try to save it from a IDE like Spyder.
The solution is to first import matplotlib.pyplot:
import matplotlib.pyplot as plt
Then, the following code will allow you to save the sklearn tree as .eps (or you could change the format accordingly):
plt.savefig('tree.eps',format='eps',bbox_inches = "tight")
To elaborate, clf is your Decision Tree classifier (to be defined before plotting the tree):
# Example from https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(iris.data, iris.target)
The outcome is a Vector Graphics format (.eps) tree that will retain its full resolution when zoomed in. The bbox_inches=”tight” command prevents truncating of the image. Without that command, sometimes the sklearn tree will just be cropped off and be incomplete.
While this result is nice, it also seems to mean that theoretically, we have already reached the limit in dimensional reduction for data compression.
Source: Science Daily
Harvard computer scientist demonstrates 30-year-old theorem still best to reduce data and speed up algorithms
- October 19, 2017
- Harvard John A. Paulson School of Engineering and Applied Sciences
- Computer scientists have found that the Johnson-Lindenstrauss lemma, a 30-year-old theorem, is the best approach to pre-process large data into a manageably low dimension for algorithmic processing.
When we think about digital information, we often think about size. A daily email newsletter, for example, may be 75 to 100 kilobytes in size. But data also has dimensions, based on the numbers of variables in a piece of data. An email, for example, can be viewed as a high-dimensional vector where there’s one coordinate for each word in the dictionary and the value in that coordinate is the number of times that word is used in the email. So, a 75 Kb email that is 1,000 words long would result in a vector in the millions.
This geometric view on data is useful in some applications, such as learning spam classifiers, but, the more dimensions, the longer it can take for an algorithm to run, and the more memory the algorithm uses.
As data processing got more and more complex in the mid-to-late 1990s, computer scientists turned to pure mathematics to help speed up the algorithmic processing of data. In particular, researchers found a solution in a theorem proved in the 1980s by mathematics William B. Johnson and Joram Lindenstrauss working the area of functional analysis.
Known as the Johnson-Lindenstrauss lemma (JL lemma), computer scientists have used the theorem to reduce the dimensionality of data and help speed up all types of algorithms across many different fields, from streaming and search algorithms, to fast approximation algorithms for statistical and linear algebra and even algorithms for computational biology.
Harvard John A. Paulson School of Engineering and Applied Sciences. “Making big data a little smaller: Harvard computer scientist demonstrates 30-year-old theorem still best to reduce data and speed up algorithms.” ScienceDaily. ScienceDaily, 19 October 2017. <www.sciencedaily.com/releases/2017/10/171019101026.htm>.