X & y Variables
Independant and Dependent Variables
Having defined machine learning, let’s now look at the basic building blocks of machine learning. This leads us to discussing data variables.
As with other fields of statistical inquiry, machine learning is built around the cross-analysis of dependent and independent variables. And the main premise is to find how the independent variable/s (X) affect the dependent variable (y) and to use this information to make predictions.
The dependent variable (y) is the output you want to predict and the independent variable (X) is an input that supposedly impacts that dependent variable (output). So, for example, the supply of oil (X) impacts the cost of fuel (y). Cost of fuel is the dependent variable which is affected by the supply which is constantly changing. Other independent variables too including population growth and economic development also affect the cost of fuel.
Another example would be analyzing how house attributes (distance to the city, suburb, number of rooms, land size, etc.) affect the selling price of homes in a particular neighborhood. A model can then be used to predict the price (y) of a house with an unknown selling price by inputting its features (X) into the prediction model.
Variables in a Dataset
Next I am going to elaborate further on variables and what they look like in a typical dataset.
So for this example, we have a dataset with online user information used for ad targeting. The dataset consists of 10 variables as highlighted in blue, including Daily Time Spent on Site, Age, Area Income and so on. Variables are also often called features in machine learning and these two terms can be used interchangeably.
In this dataset we have rows of information documenting individual users. So in row one we have a user who is 35 years of age from Tunisia. Most datasets you will find have the variables displayed horizontally from left to right and individual instances of those variables, such as product items or individual users listed vertically below. This is generally because the number of items in a dataset outnumber the number of features and it’s easier to look at the dataset scrolling down rather than scrolling across. Equally, though, there’s nothing stopping you from having at a dataset with instances at the top, and features listed vertically below.
Now in the next slide we have the independent variables highlighted in blue and the dependent variable of Clicked on Ad highlighted in green. And in this case we are attempting to predict if a user will click on an ad based on 9 input variables including Daily time spent on site, age, area income and so on.
Also, It’s important to note, that no variable is exclusively dependent or independent and we can actually switch this around based on the goals of our analysis.
In this updated example, we are using the dependent variable in green to predict how much time a user will spend on the website based on the 9 independent variables in blue.
Continuous vs Discrete Variables
Variables also have other qualities, such as continuous or discrete. Continuous variables are integers or floating-point numbers that are compatible with mathematical operations such as addition, subtraction, division, etc, such as daily time spent on site, age, area income and daily internet usage. So if we add each row together for time spent on page we can aggregate this information to find the mean value or the range. We can’t do the same though for a variable like City, because this is a discrete variable.
A discrete variable is categorical and finite in value. This means it cannot be aggregated or mathematically manipulated with other variable observations. Examples of discrete variables here include Ad Topic Line, City, Gender and Country. Even categorical variables described in numbers, such as zip codes and customer ID numbers, they also meet the criteria of discrete as they cannot be aggregated like natural numbers. In the case of our dataset, Male and Clicked on Ad are expressed used integers but these numbers can’t be aggregated like natural numbers because they used as categorical identifiers.
(Now, Age, on the other hand, is a little tricky. On one hand its easy to aggregate age and find the mean age or work out the range which makes this variable continuous in nature. However, on the other hand, it could be interpreted as a discrete variable because each age has their own discrete preferences. So for example, if you add a 30 year old and a 20 old, you won’t be able to find the preferences of a 50 year old. So Age usually is a discrete variable but in other cases it can also be used as a continuous variable. It really depends on what you are attempting to predict and the values of the other input variables.)