data – Mathtuition88

Python save csv to folder

In Python (pandas), saving a .csv file to a particular folder is not that hard, but then it may be confusing to beginners.

The packages we need to import are:

import pandas as pd
import os.path

Say, your folder name is called “myfolder”, and the dataframe you have is called “df”. To save it insider “myfolder” as “yourfilename.csv”, the following code does the job:

df.to_csv(os.path.join('myfolder','yourfilename.csv'))

The reason this may be difficult for beginners is that beginners may not know of the existence of the os.path.join method, which is the recommended method for joining one or more path components.

Recall & Precision vs Sensitivity & Specificity

Other than accuracy rate, there are various metrics for machine learning to measure how “accurate” the model is.

Some popular ones for binary classification are sensitivity (true positive rate) and specificity (true negative rate).

In computer science, recall and precision are also common metrics.

It can be quite confusing to remember offhand what each metric means, and how they are related.

To summarize, the following are equivalent (for binary classification):

sensitivity = recall of positive class

specificity = recall of negative class

Sample source: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cmdc.201700180

Other than the above metrics mentioned, there are also many other metrics, such as F1 score, etc.

Advanced Excel Date Editing and Formatting

Suppose you have an Excel file with dates in the format DD/MM/YYYY.

So you have entries like 02/03/1999 (2 March 1999). Suppose you want to change it to 2/3/1999, i.e. remove leading zeroes from one digit days and months.

It sounds easy, but it is actually quite tricky to do it in Excel without knowledge of the correct method. The normal methods of date formatting will simply not work.

After some searching, I found and tested that the following method is probably the fastest and easiest way to do it:

Select the date column and then click Data > Text to Columns Next, Next, then select Date ‘MDY’

then you should be able to do this using a number format of m/dd/yyyy or m/d/yyyy if you don’t want a leading zero in the days as well

Source: StackOverflow

Making big data a little smaller

While this result is nice, it also seems to mean that theoretically, we have already reached the limit in dimensional reduction for data compression.

Source: Science Daily

Harvard computer scientist demonstrates 30-year-old theorem still best to reduce data and speed up algorithms

Date:: October 19, 2017
Source:: Harvard John A. Paulson School of Engineering and Applied Sciences
Summary:: Computer scientists have found that the Johnson-Lindenstrauss lemma, a 30-year-old theorem, is the best approach to pre-process large data into a manageably low dimension for algorithmic processing.

When we think about digital information, we often think about size. A daily email newsletter, for example, may be 75 to 100 kilobytes in size. But data also has dimensions, based on the numbers of variables in a piece of data. An email, for example, can be viewed as a high-dimensional vector where there’s one coordinate for each word in the dictionary and the value in that coordinate is the number of times that word is used in the email. So, a 75 Kb email that is 1,000 words long would result in a vector in the millions.

This geometric view on data is useful in some applications, such as learning spam classifiers, but, the more dimensions, the longer it can take for an algorithm to run, and the more memory the algorithm uses.

As data processing got more and more complex in the mid-to-late 1990s, computer scientists turned to pure mathematics to help speed up the algorithmic processing of data. In particular, researchers found a solution in a theorem proved in the 1980s by mathematics William B. Johnson and Joram Lindenstrauss working the area of functional analysis.

Known as the Johnson-Lindenstrauss lemma (JL lemma), computer scientists have used the theorem to reduce the dimensionality of data and help speed up all types of algorithms across many different fields, from streaming and search algorithms, to fast approximation algorithms for statistical and linear algebra and even algorithms for computational biology.

Source:

Harvard John A. Paulson School of Engineering and Applied Sciences. “Making big data a little smaller: Harvard computer scientist demonstrates 30-year-old theorem still best to reduce data and speed up algorithms.” ScienceDaily. ScienceDaily, 19 October 2017. <www.sciencedaily.com/releases/2017/10/171019101026.htm>.