In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent and an independent variable. We have fitted our regressor object to the training set so that the model can easily learn the correlations https://accounting-services.net/ between the predictor and target variables. After executing the above lines of code, we will get the below output. Can lead to a model that attempts to fit the outliers more than the data. Linear Regression is a statistical approach for modelling the relationship between a dependent variable and a given set of independent variables.

## What kind of Experience do you want to share?

Here we are taking a green color for the observation, but it can be any color as per the choice. In summary, this linear regression analysis suggests that there is a significant relationship between years of experience (Years_Exp) and salary (Salary). The model explains approximately 72.66% of the variance in salaries, and both the intercept and the coefficient for “Years_Exp” are statistically significant at the 0.01 and 0.05 significance levels, respectively. For understanding the concept let’s consider a salary dataset where it is given the value of the dependent variable(salary) for every independent variable(years experienced). For each of these deterministic relationships, the equation exactly describes the relationship between the two variables.

- Although the OLS article argues that it would be more appropriate to run a quadratic regression for this data, the simple linear regression model is applied here instead.
- We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set respectively.
- By executing the above lines of code, we will get the below graph plot as an output.
- Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

## Data Processing

In this step, we will provide the test dataset (new observations) to the model to check whether it can predict the correct output or not. In the above plot, we can see the real values observations in green dots and predicted values are covered by the red regression line. The regression line shows a correlation between the dependent and independent variable. The good fit of the line can be observed by calculating the difference between actual values and predicted values.

## Interpretation about the correlation

One variable denoted x is regarded as an independent variable and the other one denoted y is regarded as a dependent variable. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x). This data set gives average masses for women as a function of their height in a sample of American women of age 30–39. Although the OLS article argues that it would be more appropriate to run a quadratic regression for this data, the simple linear regression model is applied here instead. So, now, our model is ready to predict the output for the new observations.

## Assumptions of Linear Regression

Line fitting is the process of constructing a straight line that has the best fit to a series of data points. So here, we will use the title() function of the pyplot library and pass the name (“Salary vs Experience (Training Dataset)”. This website provides tutorials with examples, code snippets, and practical insights, making it suitable for both beginners and experienced developers.

## Predict values using predict function

Now, we have to find a line that fits the above scatter plot through which we can predict any value of y or response for any value of x The line which best fits is called the Regression line. By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen by clicking on the variable eom in accounting explorer option. These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset. Before proceeding, we must clarify what types of relationships we won’t study in this course, namely, deterministic (or functional) relationships.

## Interpretation about the slope

Instead, we are interested in statistical relationships, in which the relationship between the variables is not perfect. To do so, we will use the scatter() function of the pyplot library, which we have already imported in the pre-processing step. On executing the above lines of code, two variables named y_pred and x_pred will generate in the variable explorer options that contain salary predictions for the training set and test set. Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression. Where Y is the object containing the dependent variable to be predicted and the model is the formula for the chosen mathematical model.The command lm( ) provides the model’s coefficients but no further statistical information.

It is predicted that a straight line can be used to approximate the relationship. The goal of linear regression is to identify the line that minimizes the discrepancies between the observed data points and the line’s anticipated values. It is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables.

Where y is the predicted response value, a is the y-intercept, x is the feature value and b is the slope.To create the model, let’s evaluate the values of regression coefficients a and b. And as soon as the estimation of these coefficients is done, the response model can be predicted. These quantities would be used to calculate the estimates of the regression coefficients, and their standard errors. You might anticipate that if you lived in the higher latitudes of the northern U.S., the less exposed you’d be to the harmful rays of the sun, and therefore, the less risk you’d have of death due to skin cancer. There appears to be a negative linear relationship between latitude and mortality due to skin cancer, but the relationship is not perfect.

But as we can see in the above plot, most of the observations are close to the regression line, hence our model is good for the training set. Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the line. In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function, we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and color of the observations.

Here we are also changing the color of observations and regression line to differentiate between the two plots, but it is optional. By executing the above lines of code, we will get the below graph plot as an output. You can check the variable by clicking on the variable explorer option in the IDE, and also compare the result by comparing values from y_pred and y_test. By comparing these values, we can check how good our model is performing. Where Cov and Var refer to the covariance and variance of the sample data (uncorrected for bias).The last form above demonstrates how moving the line away from the center of mass of the data points affects the slope.

Indeed, the plot exhibits some “trend,” but it also exhibits some “scatter.” Therefore, it is a statistical relationship, not a deterministic one. In the previous step, we have visualized the performance of our model on the training set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test instead of x_train and y_train. We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set respectively.

In the above plot, there are observations given by the blue color, and prediction is given by the red regression line. As we can see, most of the observations are close to the regression line, hence we can say our Simple Linear Regression is a good model and able to make good predictions. In the above output image, we can see the X (independent) variable and Y (dependent) variable has been extracted from the given dataset. In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column from the dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second column and indexing starts from the zero. In this section, we will create a Simple Linear Regression model to find out the best fitting line for representing the relationship between these two variables.