One of the main differences between a data scientist and a data engineer has to do with ETL versus DAD:
- ETL (Extract/Load/Transform) is for data engineers, or sometimes data architects or database administrators (DBA).
- DAD (Discover/Access /Distill) is for data scientists.
Data engineers tend to focus on software engineering, data base design, production code, and making sure data is flowing smoothly between source (where it is collected) and destination (where it is extracted and processed, with statistical summaries and output produced by data science algorithms, eventually moved back to the source or elsewhere). Data scientists, while they need to understand this data flow (and how it is optimized, especially when working with Hadoop) don’t actually optimize the data flow itself, but rather the data processing step: extracting value from data. But they work with engineers and business people to define the metrics, design data collecting schemes and make sure data science processes integrate efficiently with the enterprise data systems (storage, data flow). This is especially true for data scientists working in small companies, and a reason why data scientists should be able to write code (more and more, Python) re-usable by engineers.
Sometimes data engineers do DAD, and sometimes data scientists do ETL, but it’s not common, and when they do it’s usually internal. For example, the data engineer may do a bit of statistical analysis to optimize some database processes, or the data scientist may do a bit of database management to manage a small, local, private database of summarized information.
DAD is comprised of:
- Discover: Find, identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts).
- Access: Access the data. Sometimes via an API, a web crawler, an Internet download, a database access or sometimes in-memory within a database.
- Distill: Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system). It involves
- Exploring the data (creating a data dictionary and exploratory analysis)
- Cleaning (removing impurities)
- Refining (data summarization, sometimes multiple layers of summarization or hierarchical summarization)
- Analyzing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modeling
- Presenting results or integrating results in some automated process
Data science is at the intersection of computer science, business engineering, statistics, data mining, machine learning, operations research, six sigma, automation, and domain expertise. It brings together a number of techniques, processes, and methodologies from different fields, together with business vision and action. Data science is about bridging the different components that contribute to business optimization at large, and eliminating the silos that slow down business efficiency.
Some employers are looking for Java or database developers with strong statistical knowledge. These professionals are very rare, so instead the employers sometimes try to hire a data scientist, hoping he/she is strong in developing production code. If you don’t have that level of Java or database expertise, it can be a waste of time to attend these interviews. You should ask upfront if the position to be filled is a Java developer with statistics knowledge, or a statistician with strong Java skills, during your phone interview, though sometimes the hiring manager is unsure what he really wants, and you might be able to convince him to hire a guy like you if you tell the added value that you expertise brings. It is easier for an employer to get a Java software engineer to learn statistics (especially using this book as training material) than the other way around.