I just found out by trial and error that the suppressing of print statements in RStudio greatly speeds up the R code.
In my case, code that was originally estimated to take around 40 hours to run, just ran in under an hour after I suppressed all the print statements in the for loops.
This is supported by evidence in other forums, for example in StackOverflow: R: Does the use of the print function inside a for loop slow down R?
Basically, if your code prints too much output to the console, it will slow down RStudio and your R code as well. It may be due to all the output clogging up the memory in RStudio. R is known to be “single thread” so it can only use 1 CPU at a time, even if your computer has multiple cores.
Hence, the tips are to:
- Reduce the number of print statements in the code manually.
- Set quiet=TRUE in all scan statements. Basically, the default behavior is that scan() will print a line, saying how many items have been read.
This is especially true with for loops, since the amount of printed output can easily number to the millions, and overwhelm RStudio.
Most people have heard of chi-squared test, but not many know that there are (at least) two types of chi-squared tests.
The two most common chi-squared tests are:
- 1-way classification: Goodness-of-fit test
- 2-way classification: Contingency test
The goodness-of-fit chi-squared test is to test proportions, or to be precise, to test if an an observed distribution fits an expected distribution.
The contingency test (the more classical type of chi-squared test) is to test the independence or relatedness of two random variables.
The best website I found regarding how to practically code (in R) for the two chi-squared tests is: https://web.stanford.edu/class/psych252/cheatsheets/chisquare.html
I created a PDF copy of the above site, in case it becomes unavailable in the future:
Chi-squared Stanford PDF
Best Videos on each type of Chi-squared test
Goodness of fit Chi-squared test video by Khan Academy:
Contingency table chi-square test:
Most of the time, users of R and Python will rely on packages and libraries as far as possible, in order to avoid “reinventing the wheel”. Packages that are established are also often superior and preferred, due to lower chance of errors and bugs.
We list down the most popular and useful packages in R and Python for data science, statistics, and machine learning.
Packages in R
In the R language, often you have to convert variables to “factor” or “categorical”. There is a known issue in the ‘caret’ library that may cause errors when you do that in a certain way.
The correct way to convert variables to ‘factor’ is:
trainset$Churn = as.factor(trainset$Churn)
In particular, “the train() function in caret does not handle factor variables well” when you convert to factors using other methods.
Basically, if you use other ways to convert to ‘factor’, the code may still run, but there may be some ‘weird’ issues that leads to inaccurate predictions (for instance if you are doing logistic regression, decision trees, etc.)
R has the package “psych” which allows one to calculate the Cronbach’s alpha very easily just by one line:
For Python, the situation is more tricky since there does not seem to exist any package for calculating Cronbach’s alpha. Fortunately, the formula is not very complicated and it can be calculated in a few lines.
An existing code can be found on StackOverflow, but it has some small “bugs”. The corrected version is:
itemscores = np.asarray(itemscores)
itemvars = itemscores.var(axis=0, ddof=1)
tscores = itemscores.sum(axis=1)
nitems = itemscores.shape
return (nitems / (nitems-1)) * (1 - (itemvars.sum() / tscores.var(ddof=1)))
The input “itemscores” can be your Pandas DataFrame or any numpy array. (Note that this method requires you to “import numpy as np”).
The R programming language has an excellent package “psych” that Python has no real equivalent of.
For example, R can do the following code using the principal() function:
principal(r=dat, nfactors=num_pcs, rotate="varimax")
to return the “rotation matrix” in principal component analysis based on the data “dat” and the number of principal components “num_pcs”, using the “varimax” method.
The closest equivalent in Python is to first use the factor_analyzer package:
from factor_analyzer import FactorAnalyzer
Then, we use the following code to get the “rotation matrix”:
fa = FactorAnalyzer(n_factors=3, method='principal', rotation="varimax")