Data Scrubbing

Having reviewed a sample dataset, we next need to talk about data scrubbing which involves manipulating data in preparation for analysis. Some algorithms, for example, can’t process discrete variables or they return an error message in response to missing values. Linear regression, for example, analyzes continuous variables, whereas gradient boosting needs both discrete (categorical) and continuous variables are expressed numerically as an integer or floating-point number.

Duplicate information, redundant variables, and errors in the data can also conspire to derail the model’s capacity to dispense valuable insight.

Another potential consideration when working with data, and specifically private data, is removing personal identifiers that could contravene relevant data privacy regulations or damage the trust of customers, users, and other stakeholders. This is less of a problem for publicly-available datasets but something to be mindful of when working with private data.

Specific examples of data scrubbing include:

- One-hot encoding (re-express discrete variables as an integer)

- Removing variables

- Removing personal identifiers

- Merging variables

- Data reduction

Lastly, it’s worth underlining that data scrubbing does take up a lot of effort. achine learning instructor and former Google employee, Frank Kane, explains, “the inconvenient truth is you spend less time analyzing your data and more time preparing and cleaning” and not the other way around.[1] And most people in the industry would echo those words.

^[1] Frank Kane, Machine Learning, Data Science and Deep Learning with Python, Sundog Education, Udemy.com..

Discussion

Lean ML: The Beginner's Guide to Machine Learning with Python

Data Scrubbing

0 comments