The worlds of machine learning and statistics, while sharing some common mathematical vocabulary, are ultimately engaged in two different enterprises. The former seeks to leverage data to yield accurate predictions about previously unseen cases, but it is less concerned about understanding the causal process by which these outcomes are generated. Statistics, as it has come to be practiced in the 21st century, is instead interested in pinning down the causal story.

Most software developers understand the advantages of packaging up their code: it makes their functions testable, reliable and reusable. Not to mention making their future work much easier and more efficient. Here at Methods, we have realized the benefits of building R and Python packages to bundle up and test our code for collaboration both internally and with clients.
This post will help readers in the data science community see how easy it is to get started developing and testing their own Python package using the pytest framework and GitLab CI.

The second-annual rstudio::conf was held in San Diego at the end of January, bringing together a wide range of speakers, topics, and attendees. Covering all of it would require several people and a lot of space, but I’d like to highlight two broad topics that received a lot of coverage: new tools for shiny and enhanced modeling capabilities for R.
Shiny Several speakers introduced a collection of new tools for enhancing the capabilities of Shiny developers: asynchronous processing, simplified functional testing, and load testing are all coming to the shiny world.

Logistic regression produces result that are typically interpreted in one of two ways:
Predicted probabilities Odds ratios Odds are the ratio of the probability that something happens to the probabilty it doesn’t happen.
\[ \Omega(X) = \frac{p(y=1|X)}{1-p(y=1|X)} \] An odds ratio is the ratio of two odds, each calculated at a different score for \(X\).
There are strengths and weaknesses to either choice.
Predictored probabilities are intuitive, but require assuming a value for every covariate.

The purpose of this blog post is to review the derivation of the logit estimator and the interpretation of model estimates. Logit models are commonly used in statistics to test hypotheses related to binary outcomes, and the logistic classifier is commonly used as a pedagogic tool in machine learning courses as a jumping off point for developing more sophisticated predictive models. A secondary goal is to clarify some of the terminology related to logistic models, which - as should already be clear given the interchanging usage of “logit” and “logistic” - may be confusing.