{"id":1996935,"date":"2023-03-05T07:03:05","date_gmt":"2023-03-05T12:03:05","guid":{"rendered":"https:\/\/wordpress-1016567-4521551.cloudwaysapps.com\/plato-data\/campus-recruitment-a-classification-problem-with-logistic-regression\/"},"modified":"2023-03-05T07:03:05","modified_gmt":"2023-03-05T12:03:05","slug":"campus-recruitment-a-classification-problem-with-logistic-regression","status":"publish","type":"station","link":"https:\/\/platodata.io\/plato-data\/campus-recruitment-a-classification-problem-with-logistic-regression\/","title":{"rendered":"Campus Recruitment: A Classification Problem with Logistic Regression"},"content":{"rendered":"

Introduction<\/h2>\n

In this project, we will be focusing on data from India. And our goal is to create a predictive model<\/a>, such as Logistic Regression, etc. so that when we give the characteristics of a candidate, the model can predict whether they will recruit.<\/p>\n

The dataset<\/a> revolves around the placement season of a Business School in India. The dataset has various factors on candidates, such as work experience, exam percentage, etc. Finally, it contains the status of recruitment and remuneration details.<\/p>\n

\"Data<\/p>\n

Campus recruitment is a strategy for sourcing, engaging, and hiring young talent for internship and entry-level positions. It often involves working with university career services centers and attending career fairs to meet in person with college students and recent graduates.<\/p>\n

This article was published as a part of the Data Science Blogathon.<\/a><\/p>\n

Table of Contents<\/h2>\n
    \n
  1. Steps Involved in Solving the Problem<\/a><\/li>\n
  2. Prepare Data<\/a><\/li>\n
  3. Build a Logistic Regression Model<\/a><\/li>\n
  4. Results of the Logistic Regression Model<\/a><\/li>\n
  5. Conclusion<\/a><\/li>\n<\/ol>\n

    Steps Involved in Solving the Problem<\/h2>\n

    In this article, we will import that dataset, clean it, and then prepare it to build a logistic regression model. Our goals here are the following:<\/p>\n

    First, We\u2019re going to prepare our data set for binary classification<\/a>. Now, what do I mean? when we try to predict a continuous value, like the price of an apartment, it can be any number between zero and many million dollars. We call it a regression problem.<\/p>\n

    But in this project, things are a little bit different. Instead of predicting a continuous value, we have discrete groups or classes we are trying to predict between them.  So this is called a Classification problem, and because in our project, we will have only two groups that we\u2019re trying to choose between, that makes it a binary classification.<\/p>\n

    The second goal is to create a logistic regression model to predict recruitment. And our third goal is to explain our model\u2019s predictions using the odds ratio.<\/p>\n

    Now in terms of the machine learning workflow, the steps we will follow, and some of the new things, we will learn along the way. So in the import phase, we will prepare our data to work with a binary target. In the exploration phase, We will be looking at the class balance. So basically, what proportion of candidates was hird, and what proportions weren\u2019t? and in the features encoding phase, we will do encoding to our categorical features. In the split part, we will do a randomized train test split.<\/p>\n

    For the model-building phase, firstly, we will set our baseline, and because we will use accuracy scores, we\u2019ll talk more about what an accuracy score is and how to build a baseline when that\u2019s the metric we\u2019re interested in. Secondly, we will be doing logistic regression. And then last but not least, we will have the evaluation phase. We will again focus on the accuracy score. Finally, to communicate results, we will look at the odds ratio.<\/p>\n

    Lastly, Before diving into the work, let\u2019s introduce ourselves to the libraries we will use throw the project. First, we will import our data to Google Colabe notebook into the io library. Then, as we\u2019ll use a logistic regression model, we\u2019ll import that from scikit-learn. After that, also from scikit-learn<\/a>, we will import our performance metrics, the accuracy score, and the train-test-split.<\/p>\n

    We will use Matplotlib<\/a> and seaborn for our visualization, and NumPy<\/a>  will be just for little math.
    We need 
    pandas<\/a> to manipulate our data, labelencoder to encode our categorical variables, and standard scaler to normalize the data. That\u2019ll be the libraries that we need.<\/p>\n

    Let\u2019s jump into preparing the data.<\/p>\n

    #import libraries\nimport io\nimport warnings import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nfrom sklearn.preprocessing import LabelEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler warnings.simplefilter(action=\"ignore\", category=FutureWarning)<\/code><\/code><\/pre>\n

    Prepare Data<\/h2>\n

    Import<\/h4>\n

    To start our preparing the data, let\u2019s get our important work. First, we load our data file, and then we need to put them into a DataFrame `df.`<\/p>\n

    from google.colab import files\nuploaded = files.upload()<\/code><\/code><\/pre>\n
    # Read CSV file\ndf = pd.read_csv(io.BytesIO(uploaded[\"Placement_Data_Full_Class.csv\"]))\nprint(df.shape)\ndf.head()<\/code><\/code><\/pre>\n
    \"Dataset<\/figure>\n

    We can see our beautiful DataFrame, and we have 215 records and 15 columns that include the `status` attribute, our target. This is the description for all features.<\/p>\n

    \"Logistic<\/figure>\n

    Explore<\/h4>\n

    Now we have all these features which we are going to explore. So let\u2019s start our exploratory data analysis<\/a>. First, let\u2019s take a look at the info for this dataframe and see if any of them we may need to keep or if we maybe need to drop.<\/p>\n

    # Inspect DataFrame\ndf.info() <class 'pandas.core.frame.DataFrame'>\nRangeIndex: 215 entries, 0 to 214\nData columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sl_no 215 non-null int64 1 gender 215 non-null object 2 ssc_p 215 non-null float64 3 ssc_b 215 non-null object 4 hsc_p 215 non-null float64 5 hsc_b 215 non-null object 6 hsc_s 215 non-null object 7 degree_p 215 non-null float64 8 degree_t 215 non-null object 9 workex 215 non-null object 10 etest_p 215 non-null float64 11 specialisation 215 non-null object 12 mba_p 215 non-null float64 13 status 215 non-null object 14 salary 148 non-null float64\ndtypes: float64(6), int64(1), object(8)\nmemory usage: 25.3+ KB<\/code><\/code><\/pre>\n

    Now when we look at `df` info, there are a couple of things that we\u2019re looking for, we have 215 rows in our dataframe, and the question we want to ask ourselves is, is there any missing data? And if we look here, it seems we don\u2019t have missing data except for the salary column, as expected, due to candidates who have not been hired.<\/p>\n

    Another concern for us here is, are there any leaky features that would give information to our model that it wouldn\u2019t have if it deployed in the real world? Remember that we want our model to predict whether a candidate will place or not, and we want our model to make those predictions before the recruitment happens. So we don\u2019t want to give any information about these candidates after the recruitment.<\/p>\n

    So, it\u2019s pretty clear that this `salary` feature gives information about the salary offered by the corporate. And because this salary is for those accepted, this feature here constitutes leakage, and we have to drop it.<\/p>\n

    df.drop(columns=\"salary\", inplace=True)<\/code><\/code><\/pre>\n

    The second thing I want to look at is the data types for these different features. So, looking at these data types, we have eight categorical features with our target and seven numerical features, and everything is correct. So, now that we have these ideas let\u2019s take some time to explore them more deeply.<\/p>\n

    We know that our target has two classes. We have placed candidates and not placed candidates. The question is, what is the relative proportion of those two classes? Are they about the same balance? Or is one a lot more than the other? That\u2019s something that you need to take a look at when you\u2019re doing classification problems. So this is a significant step in our EDA.<\/p>\n

    # Plot class balance\ndf[\"status\"].value_counts(normalize=True).plot( kind=\"bar\", xlabel=\"Class\", ylabel=\"Relative Frequency\", title=\"Class Balance\"\n);<\/code><\/code><\/pre>\n
    \"Class<\/figure>\n

    Our positive class \u2018placed\u2019 counts for more than 65% of our observations, and our negative class \u2018Not Placed is around 30%. Now, if these were super imbalanced, like, if it were more like 80 or even more than that, I would say these are imbalanced classes. And we\u2019d have to do some work to make sure our model is going to function in the right way. But this is an okay balance.<\/p>\n

    Let\u2019s make another visualization to notice the connection between our features and the target. Let\u2019s start with the numerical features.<\/p>\n

    First, we will see the individual distribution of the features using a distribution plot, and we will also see the relationship between the numerical features and our target by using a box plot.<\/p>\n

    fig,ax=plt.subplots(5,2,figsize=(15,35))\nfor index,i in enumerate(df.select_dtypes(\"number\").drop(columns=\"sl_no\")): plt.suptitle(\"Visualizing Distribution of Numerical Columns Indivualy and by Class\",size=20) sns.histplot(data=df, x=i, kde=True, ax=ax[index,0]) sns.boxplot(data=df, x='status', y=i, ax=ax[index,1]);<\/code><\/code><\/pre>\n
    \""Logistic<\/figure>\n

    In the first column from our plot, we can see that all the distributions follow a normal distribution, and most of the candidate\u2019s educational performances are between 60-80%.<\/p>\n

    In the second column, we have a double box plot with the \u2018Placed\u2019 class on the right and then the `Not Placed` class on the left. For the \u2018etest_p\u2019 and \u2018mba_p\u2019 features, there\u2019s not a lot of difference in these two distributions from a model-building perspective. There is a significant overlap in the distribution over the classes, so these features would not be a good predictor of our target. As for the rest of the features, there are distinct enough to take them as potential good predictors of our target. Let\u2019s move on to the categorical features. And to explore them, we will use a count plot.<\/p>\n

    fig,ax=plt.subplots(7,2,figsize=(15,35))\nfor index,i in enumerate(df.select_dtypes(\"object\").drop(columns=\"status\")): plt.suptitle(\"Visualizing Count of Categorical Columns\",size=20) sns.countplot(data=df,x=i,ax=ax[index,0]) sns.countplot(data=df,x=i,ax=ax[index,1],hue=\"status\")<\/code><\/code><\/pre>\n
    \""Logistic<\/figure>\n

    Looking at the plot, we see that we have more male candidates than females. And most of our candidates don\u2019t have any work experience, but these candidates got hired more than the ones who had. We have candidates who did commerce as their \u2018hsc\u2019 course, and as well as an undergrad, the candidates with a science background are the second highest in both cases.<\/p>\n

    A little note on logistic regression models, although they are for classification, they\u2019re in the same group as other linear models like linear regression, and for that reason, since they\u2019re both linear models. We also need to worry about the issue of multicollinearity. So we need to create a correlation matrix, and then we need to plot it out in a heatmap. We don\u2019t want to look at all the features here, we want to look just at the numerical features, and we don\u2019t want to include our target. Since if our target correlates with some of our features, that is very good.<\/p>\n

    corr = df.select_dtypes(\"number\").corr()\n# Plot heatmap of `correlation`\nplt.title('Correlation Matrix')\nsns.heatmap(corr, vmax=1, square=True, annot=True, cmap='GnBu');<\/code><\/code><\/pre>\n
    \"correlation<\/figure>\n

    Here are the light blue, which means little to no correlation, and the dark blue, with which we have a higher correlation. So we want to be on the lookout for those dark blue. We can see a dark blue line, a diagonal line going down the middle of this plot. Those are the features that are correlated with themselves. And then, we see some dark squares. That means we have a bunch of correlations between features.<\/p>\n

    At the final step of our EDA, we need to check for high-low cardinality in the categorical features. Cardinality refers to the number of unique values in a categorical variable. High cardinality means that categorical features have a large number of unique values. There is no exact number of unique values that makes a feature high-cardinality. But if the value of the categorical feature is unique for almost all observations, it can usually be dropped.<\/p>\n

    # Check for high- and low-cardinality categorical features\ndf.select_dtypes(\"object\").nunique() gender 2\nssc_b 2\nhsc_b 2\nhsc_s 3\ndegree_t 3\nworkex 2\nspecialisation 2\nstatus 2\ndtype: int64<\/code><\/code><\/pre>\n

    I don\u2019t see any columns where the number of unique values is one or anything super high. But I think there\u2019s one categorical type column that we\u2019re missing here. And the reason is that it\u2019s not encoded as an object but as an integer. The \u2018sl_no\u2019 column isn\u2019t an integer in the sense we know. These candidates are ranked in some order. Just a unique name tag, and the name is like a category, right? So this is a categorical variable. And it doesn\u2019t have any information, so we need to drop it.<\/p>\n

    df.drop(columns=\"sl_no\", inplace=True)<\/code><\/code><\/pre>\n

    Features Encoding<\/h4>\n

    We finished our analysis, and the next thing we need to do is encode our categorical features, I will use the \u2018LabelEncoder\u2019. Label Encoding is a popular encoding technique for handling categorical variables. With using this technique, each label is assigned a unique integer based on alphabetical ordering.<\/p>\n

    lb = LabelEncoder () cat_data = ['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation', 'status']\nfor i in cat_data: df[i] = lb.fit_transform(df[i]) df.head()<\/code><\/code><\/pre>\n
    \"code<\/figure>\n

    Split<\/h4>\n

    We imported and cleaned our data. We\u2019ve done a bit of exploratory data analysis, and now we need to split our data. We have two types of split: vertical split or features-target and horizontal split or train-test sets.Let\u2019s start with the vertical one. We will Create our feature matrix \u2018X\u2019 and target vector \u2018y\u2019. Our target is \u201cstatus.\u201d Our features should be all the columns that remain in the \u2018df.\u2019<\/p>\n

    #vertical split\ntarget = \"status\"\nX = df.drop(columns = target)\ny = df[target]<\/code><\/code><\/pre>\n

    Models generally perform better when they have normalized data to train with, so what is normalization? Normalization<\/a> is transforming the values of several variables into a similar range. Our target is to normalize our variables. So their value ranges will be from 0 to 1. Let\u2019s do that, and I will use the `StandardScaler.`<\/p>\n

    scaler = StandardScaler()\nX = scaler.fit_transform(X)<\/code><\/code><\/pre>\n

    Now let\u2019s do the horizontal split or train-test sets. We need to divide our data (X and y) into training and test sets using a randomized train-test split. our test set should be 20% of our total data. And we don\u2019t forget to set a random_state for reproducibility.<\/p>\n

    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 42 ) print(\"X_train shape:\", X_train.shape)\nprint(\"y_train shape:\", y_train.shape)\nprint(\"X_test shape:\", X_test.shape)\nprint(\"y_test shape:\", y_test.shape) X_train shape: (172, 12)\ny_train shape: (172,)\nX_test shape: (43, 12)\ny_test shape: (43,)<\/code><\/code><\/pre>\n

    Build a Logistic Regression Model<\/h2>\n

    Baseline<\/h4>\n

    So now we need to begin building our model, and we\u2019ll need to start ordering to set our baseline. Remember that the type of problem we\u2019re dealing with is a classification problem, and there are different metrics to evaluate classification models. The one I want to focus on is the accuracy score.<\/p>\n

    Now, what is the accuracy score? Accuracy score in machine learning is an evaluation metric that measures the number of correct predictions made by a model to the total number of predictions made. We calculate it by dividing the number of correct predictions by the total number of predictions. So what that means is that the accuracy score goes between 0 and 1. Zero is not good. That\u2019s where you don\u2019t want to be, and one is perfect. So Let\u2019s keep that in mind and remember that the baseline is a model that gives one prediction over and over again, regardless of what the observation is, only one guess for us.<\/p>\n

    In our case, we have two classes, placed or not. So if we could make only one prediction, what would be our one guess? If you said the majority class. I think that makes sense, right? If we can only have one prediction, we should probably choose the one with the highest observations in our dataset. So, our baseline will use the percentage that the majority class shows up in the training data. If the model is not beating this baseline, the features are not adding valuable information to classify our observations.<\/p>\n

    We can use the \u2018value_counts\u2019 method with the `normalize = True` argument to calculate the baseline accuracy:<\/p>\n

    acc_baseline = y_train.value_counts(normalize=True).max()\nprint(\"Baseline Accuracy:\", round(acc_baseline, 2)) Baseline Accuracy: 0.68<\/code><\/code><\/pre>\n

    We can see that our baseline accuracy is 68% or 0.68 as a proportion. So to add value to be of use, we want to get above that number and get closer to one. That\u2019s our goal, and now let\u2019s start building our model.<\/p>\n

    Iterate<\/h4>\n

    Now it\u2019s time to build our model using Logistic Regression. We will use logistic regression, but before we do, let\u2019s talk a little about what logistic regression is and how it works, and then we can do the coding stuff. And for that, here we have a little grid.<\/p>\n

    Along the x axis, let\u2019s say I have the p_degrees of candidates in our data set. And as I move from right to left, the degrees get higher and higher, and then along the Y axis, I have the possible classes for placement:  zero and one.<\/p>\n

    \"graph<\/figure>\n

    So if we were to plot out our data points, what would it look like? Our analysis shows that a high `p_degree` candidate is more likely to be hired. So, it would probably look something like this, where the candidate with a small `p_degree` would be down at zero. And the candidate with a high `p_degree` would be up at one.<\/p>\n

    \"p-degree<\/figure>\n

    Now let\u2019s say that we wanted to do linear regression with this. Let\u2019s say we wanted to plot a line.
    Now, if we did that, what would happen is that line would be plotted in such a way that it would try to be as close to all the points as possible. And so we would probably end up with a line that looked something like this. Would this be a good model?<\/p>\n

    \"iterate\"<\/figure>\n

    Not really. What would happen is regardless of the p_degree of the candidate, we would always get a sort of value. And that\u2019s not will help us because the numbers, in this context, don\u2019t mean anything. This classification problem needs to either be zero or one. So, it\u2019s not going to work that way.<\/p>\n

    On the other hand, because this is a line, what if we have a candidate with a very low p_degree? Well, all of a sudden, our estimate is a negative number. And again, this doesn\u2019t make any sense. There is no negative number either needs to be zero or one. And in the same way, if we have a candidate with a very high p_degree, I might have a positive, something above one. And again, that doesn\u2019t make any sense. We need to either have a zero or one.<\/p>\n

    \"prediction\"<\/figure>\n

    So what we see here are some serious limitations to using linear regression for classification. So what do we need to do?  We need to create a model that number one: doesn\u2019t go below zero or above one, so it needs to be bound between zero and one. And the number two, whatever comes out of that function, that equation that we create, we maybe shouldn\u2019t treat it as the prediction per se but as a step towards making our final prediction.<\/p>\n

    Now, let me unpack what I just said, and let\u2019s remind ourselves that when we\u2019re doing our linear regression models, we end up with this linear equation, Which is the simplest form. And this is that equation or function that gives us that straight line.<\/p>\n

    \"img\"<\/figure>\n

    There\u2019s a way to bind that line between 0 and 1. And what we can do is take this function that we\u2019ve just created and enclose it in another function, what\u2019s called a sigmoid function.<\/p>\n

    \"function<\/figure>\n

    So, I\u2019m going to take the linear equation we just had, and I\u2019m going to shrink it down in the sigmoid function and put it as the exponential.<\/p>\n

    \"function<\/figure>\n

    What happens is instead of getting a straight line, we get a line that looks kind of like this. It\u2019s stuck at one. It comes in and squiggles down. Then it is stuck at zero.<\/p>\n

    \"logistic<\/figure>\n

    Right, that\u2019s what the line looks like, and we can see that we\u2019ve solved our first problem. Whatever we get out of this function will be between 0 and  1. In the second step, we will not treat whatever comes out of this equation as the ultimate prediction. Instead, we will treat it as a probability.<\/p>\n

    \"logistic<\/figure>\n

    What do I mean? That means when I make a prediction, I will get some floating point value between 0 and 1. And what I will do is treat it as the probability that my prediction belongs to the positive class.<\/p>\n

    So I get a value up at 0.9999. I will say the probability that this candidate belongs to our positive, placed class is 99%. So I\u2019m almost sure that it belongs to the positive class. Conversely, if it\u2019s down at point 0.001 or whatever, I will say this number is low. The probability that this particular observation belongs to the positive, the placed class is almost zero. And so, I\u2019m going to say it belongs to class zero.<\/p>\n

    So that makes sense for numbers that are close to one or close to zero. But you might ask yourself, what do I do with other values in between? The way that works is we put a cut off line right at 0.5, so any value I get below that line, I\u2019ll put it at zero, so my prediction is no, and if it\u2019s above that line, if it\u2019s above point five, I will put this in the positive class, my prediction is one.<\/p>\n

    \"p-degree\"<\/figure>\n

    So, now I have a function that gives me a prediction between zero and one, and I treat that as a probability. And if that probability is above 0.5 or 50%, I say, okay, positive class one. And if it\u2019s below 50%, I say, that\u2019s negative class, zero. So that is the way that logistic regression works. And now we understand that, let\u2019s code it up and fit it. I will set the hyperparameter \u2018max_iter\u2019 to 1000. This parameter refers to the maximum number of iterations for the solvers to converge.<\/p>\n

    # Build model\nmodel = LogisticRegression(max_iter=1000) # Fit model to training data\nmodel.fit(X_train, y_train) LogisticRegression(max_iter=1000)<\/code><\/code><\/pre>\n

    Evaluate<\/h4>\n

    Now it\u2019s time to see how our model does. It\u2019s time to evaluate the Logistic Regression model. So, let\u2019s remember that this time, the performance metric we are interested in is the accuracy score, and we want an accurate one. And we want to beat the baseline of 0.68. Model accuracy can be calculated using the accuracy_score function. The function requires two arguments, the true labels and the predicted labels.<\/p>\n

    acc_train = accuracy_score(y_train, model.predict(X_train))\nacc_test = model.score(X_test, y_test) print(\"Training Accuracy:\", round(acc_train, 2))\nprint(\"Test Accuracy:\", round(acc_test, 2)) Training Accuracy: 0.9\nTest Accuracy: 0.88<\/code><\/code><\/pre>\n

    We can see our training accuracy at 90%. It\u2019s beating the baseline. Our test accuracy was a little lower at 88%. It also beat the baseline and was very close to our training accuracy. So that\u2019s good news because that means that our model isn\u2019t overfitting or anything.<\/p>\n

    Results of the Logistic Regression Model<\/h2>\n

    Remember that with logistic regression, we end up with these final predictions of zero or one. But underneath that prediction, there\u2019s a probability of a floating point number between zero or one, and sometimes it can be helpful to see what those probability estimates are. Let\u2019s look at our training predictions, and let\u2019s look at the first five. The \u2018predict\u2019 method predicts the target of an unlabeled observation.<\/p>\n

    model.predict(X_train)[:5] array([0, 1, 1, 1, 1])<\/code><\/code><\/pre>\n

    So those were the final predictions, but what are the probabilities behind them? To get those, we need to do a slightly different code. Instead of using the `predict` method with our model, I will use the \u2018predict_proba\u2019 with our training data.<\/p>\n

    y_train_pred_proba = model.predict_proba(X_train)\nprint(y_train_pred_proba[:5]) [[0.92003219 0.07996781] [0.03202019 0.96797981] [0.00678421 0.99321579] [0.03889446 0.96110554] [0.00245525 0.99754475]]<\/code><\/code><\/pre>\n

    We can see a kind of nested list with two different columns in it. The column on the left represents the probability that a candidate is not placed or our negative class \u2018Not Placed\u2019. The other column represents the positive class `Placed` or the probability that a candidate is placed. We will focus on the second column. If we look at the first probability estimate right, we can see that this is 0.07. So since that\u2019s below 50%, our model says, my prediction is zero. And for the following predictions, we can see that those are all above 0.5, and that\u2019s why our model predicted one in the end.<\/p>\n

    Now we want to extract the feature names and importance and put them in a series. And because we need to display the feature importance as odds ratios, we need to do just a little mathematical transformation by taking the exponential of our importance.<\/p>\n

    # Features names\nfeatures = ['gender', 'ssc_p', 'ssc_b', 'hsc_p', 'hsc_b', 'hsc_s', 'degree_p' ,'degree_t', 'workex', 'etest_p', 'specialisation', 'mba_p']\n# Get importances\nimportances = model.coef_[0]\n# Put importances into a Series\nodds_ratios = pd.Series(np.exp(importances), index= features).sort_values()\n# Review odds_ratios.head() mba_p 0.406590\ndegree_t 0.706021\nspecialisation 0.850301\nhsc_b 0.876864\netest_p 0.877831\ndtype: float64<\/code><\/code><\/pre>\n

    Before discussing the odds ratios and what they are, let\u2019s get them on a horizontal bar chart. Let\u2019s use pandas to make the plot, and remember that we will look for the five largest coefficients. And we don\u2019t want to use all the odds ratios. So we want to use the tail.<\/p>\n

    # Horizontal bar chart, five largest coefficients\nodds_ratios.tail().plot(kind=\"barh\")\nplt.xlabel(\"Odds Ratio\")\nplt.ylabel(\"Feature\")\nplt.title(\"High Importance Features\");<\/code><\/code><\/pre>\n
    \"<\/figure>\n

    Now I want you to imagine a vertical line right at 5, and I want to start by looking at it. Let\u2019s talk about each of these individually or just the first couple. So let\u2019s start here with the \u2018ssc_p,\u2019 which refers to the \u2018Secondary Education percentage \u2013 10th Grade\u2019. And we can see that the odds ratio is at 30. Now, what does that mean? It means if a candidate has a high \u2018ssc_p,\u2019 the odds of their placement are six times greater than other candidates, all things being equal. So another way to think of it is when the candidate has `ssc_p,` the chance of the candidate\u2019s recruitment increases six times.<\/p>\n

    So any odds ratio over five increases the odds that candidates are placed. And so that\u2019s why we have that vertical line at five. And these five kinds of features are characteristics that are most associated with increased recruitment. So, that\u2019s what our odds ratio is. Now, we\u2019ve looked at the features that are most associated with an increase in recruitment. Let\u2019s look at the features that are associated with it, the decrease in recruitment. So now it\u2019s time to look at the smallest ones. So instead of looking at the tail, we will look at it.<\/p>\n

    odds_ratios.head().plot(kind=\"barh\")\nplt.xlabel(\"Odds Ratio\")\nplt.xlabel(\"Odds Ratio\")\nplt.ylabel(\"Feature\")\nplt.title(\"Low Importance Features\");<\/code><\/code><\/pre>\n
    \"low<\/figure>\n

    The first thing we need to see here is that notice on the x-axis everything is one or below. Now, what does that mean? So let\u2019s take a look at our smallest odds ratio here. It\u2019s mba_p which refers to the MBA percentage. We can see that it\u2019s ready at about 0.45. Now, what does that mean? Well, the difference between 0.45 and 1 is 0.55. All right? And what does that number mean? Those Candidates with MBA are less likely to be recruited by 55%, All other things being equal. All right? So it decreased the odds of recruitment by a factor of 0.55 or 55%. And that\u2019s true for everything here.<\/p>\n

    Conclusion<\/h2>\n

    So what did we learn? First, in the prepared data phase, we learned that we are working with classification, specifically binary classification, using Logistic Regression. In terms of exploring the data, we did a ton of stuff, but in terms of highlights, we looked at class balance, right? The proportion of our positive and negative classes. Then we split our data.<\/p>\n

    Since Logistic Regression is a classification model, we learned about a new performance metric, the accuracy score. Now, the accuracy score goes between 0 and 1. Zero is bad, and one is good. When we were iterating, we learned about logistic regression. That\u2019s a magical way, where you can take a linear equation, a straight line, and put it inside another function, a sigmoid function, and an activation function, and get a probability estimate out of it and turn that probability estimate into prediction.<\/p>\n

    Finally, we learned about the odds ratio and the way we can interpret the coefficients to see if a given feature will increase the odds that we have recruited a candidate or not.<\/p>\n

    Project source code: https:\/\/github.com\/SawsanYusuf\/Campus-Recruitment.git<\/a><\/p>\n

    The media shown in this article is not owned by Analytics Vidhya and is used at the Author\u2019s discretion. <\/b><\/p>\n