I just found out by trial and error that the suppressing of print statements in RStudio greatly speeds up the R code.
In my case, code that was originally estimated to take around 40 hours to run, just ran in under an hour after I suppressed all the print statements in the for loops.
This is supported by evidence in other forums, for example in StackOverflow: R: Does the use of the print function inside a for loop slow down R?
Basically, if your code prints too much output to the console, it will slow down RStudio and your R code as well. It may be due to all the output clogging up the memory in RStudio. R is known to be “single thread” so it can only use 1 CPU at a time, even if your computer has multiple cores.
Hence, the tips are to:
- Reduce the number of print statements in the code manually.
- Set quiet=TRUE in all scan statements. Basically, the default behavior is that scan() will print a line, saying how many items have been read.
This is especially true with for loops, since the amount of printed output can easily number to the millions, and overwhelm RStudio.
When the Mac (or MacBook) is running for a long time, it is very liable to do one of the following things:
- screen saver
- lock screen
The problem is that your Python program or R program running in the background will most likely stop completely. Sure, it can resume when you activate the Mac again, but that is not what most people want! For one, it may impact the accurate calculation of elapsed time of your Python code.
Changing settings via System Preferences -> Energy Saver is a possible solution, but it is troublesome and problematic:
- Have to switch it on and off again when not in use (many steps).
- Preventing sleep may still run into screen saver, screen lock, etc.
- Vice versa, preventing screen lock may still run into Mac sleeping, etc.
The solution is to install this free App called Amphetamine. Despite its “drug” name, it is a totally legitimate program that has high reviews everywhere. What this app does is to prevent your Mac from stopping, locking or sleeping. Hence, whatever program you are running will not halt till the program is done (or when you switch off Amphetamine).
It is a great program that does its job well! Highly recommended for anyone doing programming, video editing or downloading large files on Mac.
There are tons of ways to calculate elapsed time (in seconds) for Python code. But which is the best way?
So far, I find that the “timeit” method seems to give good results, and is easy to implement. Source: https://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python
timeit.default_timer instead of
timeit.timeit. The former provides the best clock available on your platform and version of Python automatically:
from timeit import default_timer as timer
start = timer()
end = timer()
print(end - start) # Time in seconds, e.g. 5.38091952400282
This is the answer by the user “jfs” on Stack Overflow.
Benefits of the above method include:
- Using timeit will produce far more accurate results since it will automatically account for things like garbage collection and OS differences (comment by user “lkgarrison”)
Please comment below if you know other ways of measuring elapsed time on Python!
Other methods include:
- time.clock() (Deprecated as of Python 3.3)
- time.time() (Is this a good method?)
- time.perf_counter() for system-wide timing,
- or time.process_time() for process-wide timing
This is just to highlight that the Anaconda Python Distribution does not work with the latest MacOS Catalina. I only realized upon trying to open Anaconda Navigator, after installing Catalina.
The only (good) solution seems to be reinstalling Anaconda.
MacOS Catalina was released on October 7, 2019, and has been causing quite a stir for Anaconda users. Apple has decided that Anaconda’s default install location in the root folder is not allowed. It moves that folder into a folder on your desktop called “Relocated Items,” in the Security folder. If you’ve used the .pkg installer for Anaconda, this probably broke your Anaconda installation. Many users discuss the breakage at https://github.com/ContinuumIO/anaconda-issues/issues/10998.
In Python (pandas), saving a .csv file to a particular folder is not that hard, but then it may be confusing to beginners.
The packages we need to import are:
import pandas as pd
Say, your folder name is called “myfolder”, and the dataframe you have is called “df”. To save it insider “myfolder” as “yourfilename.csv”, the following code does the job:
The reason this may be difficult for beginners is that beginners may not know of the existence of the os.path.join method, which is the recommended method for joining one or more path components.
Most of the time, users of R and Python will rely on packages and libraries as far as possible, in order to avoid “reinventing the wheel”. Packages that are established are also often superior and preferred, due to lower chance of errors and bugs.
We list down the most popular and useful packages in R and Python for data science, statistics, and machine learning.
Packages in R
This article is suitable for solving the following few problems:
- module ‘sklearn.tree’ has no attribute ‘plot_tree’
- pip install (on Spyder, Anaconda Prompt, etc.) does not install the latest package.
The leading reason for “module ‘sklearn.tree’ has no attribute ‘plot_tree” is because the sklearn package is outdated.
Sometimes “pip install scikit-learn” simply does not update the sklearn package to the latest version. Type “print(sklearn.__version__)” to get the version of sklearn on your machine, it should be at least 0.21.
The solution is to force pip to install the latest package:
pip install --no-cache-dir --upgrade <package>
In this case, we would replace <package> by “scikit-learn”.
Sometimes, pip install does not work in the Spyder IPython console, it displays an error to the effect that you should install “outside the IPython console”. This is not normal (i.e. it should not happen), but as a quick fix you can try “pip install” in Anaconda Prompt instead. It is likely that something wrong went on during the installation of Anaconda, Python, and the long-term solution is to reinstall Anaconda.
In the R language, often you have to convert variables to “factor” or “categorical”. There is a known issue in the ‘caret’ library that may cause errors when you do that in a certain way.
The correct way to convert variables to ‘factor’ is:
trainset$Churn = as.factor(trainset$Churn)
In particular, “the train() function in caret does not handle factor variables well” when you convert to factors using other methods.
Basically, if you use other ways to convert to ‘factor’, the code may still run, but there may be some ‘weird’ issues that leads to inaccurate predictions (for instance if you are doing logistic regression, decision trees, etc.)
The Scikit-Learn (sklearn) Python package has a nice function sklearn.tree.plot_tree to plot (decision) trees. The documentation is found here.
However, the default plot just by using the command
could be low resolution if you try to save it from a IDE like Spyder.
The solution is to first import matplotlib.pyplot:
import matplotlib.pyplot as plt
Then, the following code will allow you to save the sklearn tree as .eps (or you could change the format accordingly):
plt.savefig('tree.eps',format='eps',bbox_inches = "tight")
To elaborate, clf is your Decision Tree classifier (to be defined before plotting the tree):
# Example from https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(iris.data, iris.target)
The outcome is a Vector Graphics format (.eps) tree that will retain its full resolution when zoomed in. The bbox_inches=”tight” command prevents truncating of the image. Without that command, sometimes the sklearn tree will just be cropped off and be incomplete.
R has the package “psych” which allows one to calculate the Cronbach’s alpha very easily just by one line:
For Python, the situation is more tricky since there does not seem to exist any package for calculating Cronbach’s alpha. Fortunately, the formula is not very complicated and it can be calculated in a few lines.
An existing code can be found on StackOverflow, but it has some small “bugs”. The corrected version is:
itemscores = np.asarray(itemscores)
itemvars = itemscores.var(axis=0, ddof=1)
tscores = itemscores.sum(axis=1)
nitems = itemscores.shape
return (nitems / (nitems-1)) * (1 - (itemvars.sum() / tscores.var(ddof=1)))
The input “itemscores” can be your Pandas DataFrame or any numpy array. (Note that this method requires you to “import numpy as np”).
The R programming language has an excellent package “psych” that Python has no real equivalent of.
For example, R can do the following code using the principal() function:
principal(r=dat, nfactors=num_pcs, rotate="varimax")
to return the “rotation matrix” in principal component analysis based on the data “dat” and the number of principal components “num_pcs”, using the “varimax” method.
The closest equivalent in Python is to first use the factor_analyzer package:
from factor_analyzer import FactorAnalyzer
Then, we use the following code to get the “rotation matrix”:
fa = FactorAnalyzer(n_factors=3, method='principal', rotation="varimax")
Step 1 is to create a Bash file (using any editor, even Notepad). Sample code:
python testing.py &
python testingb.py &
The above code will run two Python files “testing.py” and “testingb.py” simultaneously. Add more python scripts if needed. The first line is called the “shebang” and signifies the computer to run bash (there are various versions but according to StackOverflow the above one is the best).
The above bash file can be saved to any name and any extension, say “bashfile.txt”.
Step 2 is to login to Terminal (Mac) or Putty (Windows).
chmod +x bashfile.txt
This will make the “bashfile.txt” executable.
Follow up by typing:
This will run the “bashfile.txt” and its contents. The output will be put into a file called “nohup.out”. The “nohup” option is preferred for very long scripts since it will keep running even if the Terminal closes (due to broken connection or computer problems).
If your child is interested in a Computer Science/Data Science career in the future, do consider learning Python beforehand. Computer Science is getting very popular in Singapore again. To see how popular it is, just check out the latest cut-off point for NUS computer science, it is close to perfect score (AAA/B) for A-levels.
According to many sources, the Singapore job market (including government sector) is very interested in skills like Machine Learning/ Deep Learning/Data Science. It seems that Machine Learning can be used to do almost anything and everything, from playing chess to data analytics. Majors such as accountancy and even law are in danger of being replaced by Machine Learning. Python is the key language for such applications.
I just completed a short course on Python: Python A-Z™: Python For Data Science With Real Exercises! The course fee is payable via Skillsfuture for Singaporeans, i.e. you don’t have to pay a single cent. (You have to purchase it first, then get a reimbursement from Skillsfuture.) At the end, you will get a Udemy certificate which you can put in your LinkedIn profile.
The course includes many things from the basic syntax to advanced visualization of data. It teaches at quite a basic level, I am sure most JC students (or even talented secondary students) with some very basic programming background can understand it.
The best programming language for data science is currently Python. Try not to learn “old” languages like C++ as it can become obsolete soon. Anyway the focus is on the programming structure, it is more or less universal across different languages.
Udemy URL: Python A-Z™: Python For Data Science With Real Exercises!
Related posts on Python:
Students trying to import the package “matplotlib” on PyCharm will soon face the cryptic error message: “Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework.”
It is extremely puzzling what to do. I have researched the steps that you can follow to solve it in 3 minutes:
First you need to install the “matplotlib” on PyCharm following the instructions here: https://stackoverflow.com/questions/21883768/pycharm-and-external-libraries
Then, to import matplotlib, you need the following lines:
import matplotlib as mpl
import matplotlib.pyplot as plt
Done! You are ready to use matplotlib on PyCharm interpreter, by using:
Recently, I am thinking of learning the Python language for Math programming.
An advantage for using Python for Math Programming (e.g. testing out some hypothesis about numbers), is that the Python programming language theoretically has no largest integer value that it can handle. It can handle integers as large as your computer memory can handle. (Read more at: http://userpages.umbc.edu/~rcampbel/Computers/Python/numbthy.html)
Other programming languages, for example Java, may have a maximum integer value beyond which the program starts to fail. Java integers can only have a maximum value of , which is pretty limited if you are doing programming with large numbers (for example over a trillion). For instance, the seventh Fermat number is already 18446744073709551617. I was using Java personally until recently I needed to program larger integers to test out some hypothesis.
How to install Python (free):
Hope this is a good introduction for anyone interested in programming!
Learning Python, 5th Edition
Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages.
Complete with quizzes, exercises, and helpful illustrations, this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.
- Explore Python’s major built-in object types such as numbers, lists, and dictionaries
- Create and process objects with Python statements, and learn Python’s general syntax model
- Use functions to avoid code redundancy and package code for reuse
- Organize statements, functions, and other tools into larger components with modules
- Dive into classes: Python’s object-oriented programming tool for structuring code
- Write large programs with Python’s exception-handling model and development tools
- Learn advanced Python tools, including decorators, descriptors, metaclasses, and Unicode processing