How to Hire a Data Scientist

Jeremy Albright

Posted on

The terms Big Data, machine learning, and artificial intelligence emerged as key business buzzwords in the early 2010s and have not, a decade later, lost their allure or mystique. Companies now frequently advertise that they use data to drive decision making or that they have developed tools to help other organizations do the same. Firms in competitive industries that do not yet have strong data science capacities suffer from a fear of missing out and may dash to hire people from the ever growing pool of applicants claiming to be data scientists.

Source: https://i.redd.it/31zkb6zzn8f21.jpg

What happens all too often, however, is that expectations borne out of hype crash into several data analysis realities. Poor returns on the investment in an expensive data science team end up occurring because of 1) naivete on the part of those doing the hiring, 2) poor data infrastructure planning, and 3) undefined deliverables. Worse, given the high demand for data science skills, talent can easily move to another firm when workers feel they are not making the “big difference” that all the fanfare has promised them. The intent of this post is to highlight things that we have seen go wrong when organizations are too quick to build a data science team as well as provide suggestions for how to proceed in a thoughtful manner that can justify the investment.

The primary takeaway is that organizations need to think clearly about what they hope to achieve and hire accordingly. When budgets are tight, or when deliverables are difficult to define, it may make more sense to outsource to a data science services provider that can scale up or down as needs evolve. Such firms, like ours, offer a comprehensive suite of skills that are typically not found in a single person and would require the large expense of building a full in-house team. We understand, however, that selecting a company with which to partner must also be done carefully. Many start-ups, freelancers, and academic consultants lack the efficiency that a dynamic company requires and expects from its own staff. For organizations that wish to keep talent in-house, an understanding of potential hiring mistakes and how to avoid them can go a long way towards maximizing ROI.

What Can Go Wrong

Data science is a broad term that encompasses several overlapping fields ranging from data visualization to statistical learning to artificial intelligence. This diversity of toolkits forms a major source of confusion when it comes to hiring (or contracting with) data scientists. The following is a list of errors that companies commonly make when building a team, potentially at great cost to budgets and reputations:

  • Hiring to Meet Undefined Goals. If a company posts a job description for a data scientist, the applicant pool will consist of some candidates who create dashboards, others with a background in statistics, still others who build machine learning pipelines, and others with experience administering databases. All of these talents fall within the broad category of data science, yet they all consist of very distinct skills. Consequently, it is not possible to hire a data scientist without first having a clear set of deliverables in mind. If a company needs a machine learning engineer but hires a data visualization specialist, the results will be subpar and disappointing. On the other hand, a company may seek to avoid this mistake and opt to cover all their bases by hiring a team with different specializations. However, this leads to the next common error.
  • Hiring the Wrong Team. It is common for larger companies making an initial investment in data science to throw a lot of money at hiring a comprehensive team that covers many different skill sets. Perhaps the most common mistake is to feel the need to employ individuals who brag of a background in artificial intelligence, as AI has been hyped to the point where it is often misconstrued to be a panacea. While AI has indeed been revolutionary in areas such as image recognition and language processing, the algorithms require enormous amounts of data to train, and deep learning typically fails at more rudimentary tasks such as market forecasting or customer segmentation because the algorithms are too complex for tabular data. The resulting deep learning predictions are useless, when at the same time more prosaic forms of data analysis would have been not only appropriate but quite informative.
  • Under-Hiring a Team. On the other hand, given the costs associated with building a whole team, an organization may end up under-hiring. This is often manifest in two different ways.
    • An organization may expect a single person to do too much. Statisticians are great at isolating causal factors driving business outcomes, but they are useless as data engineers or building web applications. Data scientists each come with their own idiosyncratic toolkit and may only apply the methodology that is easiest for them based on their narrow training. In certain cases, as noted above, the data scientist will apply a methodology that is too complex for the data to support. Also possible is that the new hire will inappropriately apply a methodology that is too simple, such as applying linear models to nonlinear processes. It is common for hires, especially those straight out of undergraduate or graduate school, to only know a small set of modeling strategies. If you want to hire a Goldilocks who can get the level of complexity just right, it is essential that you verify their toolkit.
    • In an attempt to avoid committing to hiring a full time employee, a company may choose to outsource its data analytics tasks to a freelancer, often one with an academic affiliation or hired through a freelancer service. It turns out, however, that freelancers in general - and academics in particular - are minimally invested in the hiring organization and have almost no sense of urgency for completing projects. The result is deliverables that stretch deadlines and increase stress inside the organization.
xkcd curve fitting comic
Source: https://xkcd.com/2048/
  • Poor Data Infrastructure. Another common problem is that companies have data, but the information has been stored in poorly designed, or even outdated, information systems. The most time consuming task of any data science project is simply getting the data into a format that can be used for visualization or modeling. Databases that do not share common keys are difficult to join together, minimizing their usefulness, and dated infrastructure makes it difficult to do even basic tasks. Before hiring, a company needs to confirm that its data are usable in the first place, and if necessary work with a data engineer to ensure that it is.
  • Hires that Fail to See the Big Picture. The final hiring challenge is related more to data art than data science. Facility with code and complicated, highly nonlinear algorithms are certainly skills to appreciate, but their full potential cannot be realized if the analyst fails to understand the underlying problem the analysis seeks to address and how to clearly communicate to stakeholders how the work resolves the big questions. The best data scientists can reiterate the assigned task in their own words, determine what information in the data adequately measures the key variables, and explain - without code or equations - what model outputs mean for the organization. Although this may seem obvious, the most difficult part of trying to hire a data scientist is finding one who can see the big picture and integrate the analytics into a cohesive, actionable narrative for stakeholders.

How to Proceed Prudently

After diagnosing potential pitfalls to hiring, a set of best recruitment practices becomes clear. First, it is essential to identify what a company wants to accomplish from its data scientists. There is a tendency to think that it is possible to point a data scientist at some data and, yadda yadda yadda, there are exciting new insights that improve operations and change the world. This is the worst way to proceed, yet it has been reinforced by FOMO. Instead, the specific needs of the organization must be clarified before embarking on the hiring journey.

Here are some sample questions to consider asking to help define expected deliverables.

  • Are there ways I can quickly view current information about core parts of my business, such as sales, marketing, or ROI?
    • Can I link the relevant pieces of data together when needed?
    • Is a static dashboard sufficient to view the data?
    • Or do I need some type of model-based forecasting tool, updating regularly, to guide my decisions?
  • Am I storing all the data that may be useful in the future?
    • If so, will the data be easy to access?
    • If not, how can I optimally build a data collection and warehouse?
  • What do I not know about my customers that I would like to know?
    • Do I have that information, or does it need to be collected?
    • If data collection is needed, can it be automated?
  • Are there inefficiencies in my business processes?
    • If I don’t know, how can I find them?
    • If I find them, how can I address them?
  • Are there other tasks, such as triaging customer inquiries, that can be automated using AI tools?

The next step after defining deliverables is to hire, but onboarding should be done carefully and deliberately. It is often advisable to start with a data engineer. This is not somebody who necessarily knows the most cutting edge predictive algorithms (though it’s great if they do) but rather a person who understands database designs, how data tables link to each other, and how to pull exactly the information needed to pass on to the data analyst. Having somebody who is dedicated to knowing the organization’s information systems will make the work of any subsequent hires much more efficient as analytic needs increase.

Once the data architecture is well managed, the next step is to hire data scientists to work with the data given the specific business goals. It is common during the interview process to assign a candidate a toy data problem to solve. While this is helpful for weeding out those who overstate their skills, the problem needs to reflect the actual work being done so that qualified candidates are not missed. Make sure the tasks are well defined, then build test problems to reflect them.

Interviews should also assess for soft skills, especially communication. Asking for writing samples, either as part of the technical assessment or separate from it, can provide important clues about the candidate’s maturity and ability to speak with superiors and stakeholders. Another approach is to ask for a specific problem candidates have solved previously. Can they, in clear English and without jargon, lay out the problem, indicate their chosen methodology, and explain how their work resolved the issue? It is important to recognize that hiring for a position that requires both hard and soft skills is difficult to automate, and hence it is advisable for those who will be working with the data scientists to be part of the entire hiring process. HR departments when left to themselves have a poor track record of hiring individuals with technical skills.

Finally, after hiring a team it is important to provide and encourage opportunities for continued education. Data science is a quickly evolving field, as both algorithms and software tools continue to evolve at a pace that can be difficult to keep up with. Attending professional conferences can be motivating for staff and help them continue to expand their toolkit. The more your staff knows, the more likely they are to find the best data science solution to the specific problems they are given.

The Benefits of Data Science Outsourcing

For those who find building a data science team to be intimidating or prohibitively expensive, Methods Consultants provides an affordable, lower risk alternative to taking on the payroll costs of full time hires. We provide the benefits of a full team with different skills and well over a decade’s worth of experience in diverse methodologies that can address data science tasks of any size and complexity. The benefit to our clients is the ability to scale up or down as needs evolve. In this way we are able to provide customized solutions that utilize an extensive skill set without the usual overhead of hiring in-house. Our responsiveness and ability to communicate clearly to all stakeholders are a unique strength.

However your organization wishes to proceed in its data science journey, it is essential that decisions are not based simply on FOMO. Building analytics capacity within an organization is a process that must be undertaken with careful consideration in order to avoid a costly investment that fails to deliver. This can be an intimidating undertaking, but making sure it is done thoughtfully will maximize the probability of building successful data science capacities.