Logistic Regression with Non-Integer feature value











up vote
1
down vote

favorite












Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.



Let's consider the follow example :



I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.



Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.



So here are the following questions based on above




  1. How do I go about designing the training set for my logistic regression model.


  2. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?



Any help is really appreciated. Thanks !










share|improve this question






















  • I think this is more of a question for data science stack exchange datascience.stackexchange.com since you require no assistance with code.
    – jman
    Nov 9 at 9:50










  • Ok will post it there
    – Arka Roy
    Nov 9 at 10:08















up vote
1
down vote

favorite












Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.



Let's consider the follow example :



I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.



Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.



So here are the following questions based on above




  1. How do I go about designing the training set for my logistic regression model.


  2. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?



Any help is really appreciated. Thanks !










share|improve this question






















  • I think this is more of a question for data science stack exchange datascience.stackexchange.com since you require no assistance with code.
    – jman
    Nov 9 at 9:50










  • Ok will post it there
    – Arka Roy
    Nov 9 at 10:08













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.



Let's consider the follow example :



I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.



Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.



So here are the following questions based on above




  1. How do I go about designing the training set for my logistic regression model.


  2. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?



Any help is really appreciated. Thanks !










share|improve this question













Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.



Let's consider the follow example :



I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.



Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.



So here are the following questions based on above




  1. How do I go about designing the training set for my logistic regression model.


  2. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?



Any help is really appreciated. Thanks !







machine-learning regression linear-regression logistic-regression prediction






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 9 at 9:42









Arka Roy

83




83












  • I think this is more of a question for data science stack exchange datascience.stackexchange.com since you require no assistance with code.
    – jman
    Nov 9 at 9:50










  • Ok will post it there
    – Arka Roy
    Nov 9 at 10:08


















  • I think this is more of a question for data science stack exchange datascience.stackexchange.com since you require no assistance with code.
    – jman
    Nov 9 at 9:50










  • Ok will post it there
    – Arka Roy
    Nov 9 at 10:08
















I think this is more of a question for data science stack exchange datascience.stackexchange.com since you require no assistance with code.
– jman
Nov 9 at 9:50




I think this is more of a question for data science stack exchange datascience.stackexchange.com since you require no assistance with code.
– jman
Nov 9 at 9:50












Ok will post it there
– Arka Roy
Nov 9 at 10:08




Ok will post it there
– Arka Roy
Nov 9 at 10:08












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!





ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!





Now let me try to answer your questions:





  1. How do I go about designing the training set for my logistic regression model.




How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).



Example 1



I created a dataset example of 70 cases and 4 employees which looks like this:



dataset 1



Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!



The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):



Now we can do something like this:



import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)

y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)

# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})


Result



{'Ruby': 0.27545, 'Oliver': 0.15032, 
'Emily': 0.28201, 'Jack': 0.29219}


N.B
To make this relatively work you need a real big dataset!



Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).





The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!



For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).



Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!



For example, consider this dataset example for the particular employee Jack:



Example 2



dataset 2



Then you can do for example:



import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)

y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)


# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})


Result



{'On': 0.33348, 'Off': 0.66651}


N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!







  1. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
    because I know continuously increasing or decreasing variables are
    used in linear regression. Is that true ?




Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:



The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).






share|improve this answer























  • I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
    – Arka Roy
    Nov 9 at 19:57










  • Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
    – Arka Roy
    Nov 9 at 19:59










  • @Arka This should be a new different question. Accept this answer for now please.
    – Yahya
    Nov 9 at 22:07










  • Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
    – Arka Roy
    Nov 10 at 18:25











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223252%2flogistic-regression-with-non-integer-feature-value%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!





ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!





Now let me try to answer your questions:





  1. How do I go about designing the training set for my logistic regression model.




How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).



Example 1



I created a dataset example of 70 cases and 4 employees which looks like this:



dataset 1



Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!



The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):



Now we can do something like this:



import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)

y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)

# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})


Result



{'Ruby': 0.27545, 'Oliver': 0.15032, 
'Emily': 0.28201, 'Jack': 0.29219}


N.B
To make this relatively work you need a real big dataset!



Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).





The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!



For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).



Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!



For example, consider this dataset example for the particular employee Jack:



Example 2



dataset 2



Then you can do for example:



import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)

y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)


# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})


Result



{'On': 0.33348, 'Off': 0.66651}


N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!







  1. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
    because I know continuously increasing or decreasing variables are
    used in linear regression. Is that true ?




Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:



The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).






share|improve this answer























  • I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
    – Arka Roy
    Nov 9 at 19:57










  • Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
    – Arka Roy
    Nov 9 at 19:59










  • @Arka This should be a new different question. Accept this answer for now please.
    – Yahya
    Nov 9 at 22:07










  • Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
    – Arka Roy
    Nov 10 at 18:25















up vote
1
down vote



accepted










Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!





ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!





Now let me try to answer your questions:





  1. How do I go about designing the training set for my logistic regression model.




How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).



Example 1



I created a dataset example of 70 cases and 4 employees which looks like this:



dataset 1



Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!



The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):



Now we can do something like this:



import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)

y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)

# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})


Result



{'Ruby': 0.27545, 'Oliver': 0.15032, 
'Emily': 0.28201, 'Jack': 0.29219}


N.B
To make this relatively work you need a real big dataset!



Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).





The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!



For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).



Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!



For example, consider this dataset example for the particular employee Jack:



Example 2



dataset 2



Then you can do for example:



import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)

y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)


# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})


Result



{'On': 0.33348, 'Off': 0.66651}


N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!







  1. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
    because I know continuously increasing or decreasing variables are
    used in linear regression. Is that true ?




Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:



The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).






share|improve this answer























  • I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
    – Arka Roy
    Nov 9 at 19:57










  • Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
    – Arka Roy
    Nov 9 at 19:59










  • @Arka This should be a new different question. Accept this answer for now please.
    – Yahya
    Nov 9 at 22:07










  • Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
    – Arka Roy
    Nov 10 at 18:25













up vote
1
down vote



accepted







up vote
1
down vote



accepted






Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!





ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!





Now let me try to answer your questions:





  1. How do I go about designing the training set for my logistic regression model.




How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).



Example 1



I created a dataset example of 70 cases and 4 employees which looks like this:



dataset 1



Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!



The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):



Now we can do something like this:



import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)

y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)

# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})


Result



{'Ruby': 0.27545, 'Oliver': 0.15032, 
'Emily': 0.28201, 'Jack': 0.29219}


N.B
To make this relatively work you need a real big dataset!



Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).





The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!



For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).



Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!



For example, consider this dataset example for the particular employee Jack:



Example 2



dataset 2



Then you can do for example:



import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)

y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)


# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})


Result



{'On': 0.33348, 'Off': 0.66651}


N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!







  1. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
    because I know continuously increasing or decreasing variables are
    used in linear regression. Is that true ?




Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:



The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).






share|improve this answer














Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!





ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!





Now let me try to answer your questions:





  1. How do I go about designing the training set for my logistic regression model.




How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).



Example 1



I created a dataset example of 70 cases and 4 employees which looks like this:



dataset 1



Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!



The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):



Now we can do something like this:



import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)

y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)

# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})


Result



{'Ruby': 0.27545, 'Oliver': 0.15032, 
'Emily': 0.28201, 'Jack': 0.29219}


N.B
To make this relatively work you need a real big dataset!



Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).





The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!



For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).



Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!



For example, consider this dataset example for the particular employee Jack:



Example 2



dataset 2



Then you can do for example:



import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)

y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])

# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)


# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})


Result



{'On': 0.33348, 'Off': 0.66651}


N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!







  1. In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
    because I know continuously increasing or decreasing variables are
    used in linear regression. Is that true ?




Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:



The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 11 at 10:01

























answered Nov 9 at 17:57









Yahya

3,5092828




3,5092828












  • I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
    – Arka Roy
    Nov 9 at 19:57










  • Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
    – Arka Roy
    Nov 9 at 19:59










  • @Arka This should be a new different question. Accept this answer for now please.
    – Yahya
    Nov 9 at 22:07










  • Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
    – Arka Roy
    Nov 10 at 18:25


















  • I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
    – Arka Roy
    Nov 9 at 19:57










  • Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
    – Arka Roy
    Nov 9 at 19:59










  • @Arka This should be a new different question. Accept this answer for now please.
    – Yahya
    Nov 9 at 22:07










  • Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
    – Arka Roy
    Nov 10 at 18:25
















I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
– Arka Roy
Nov 9 at 19:57




I got your answer. I might have chosen name as a wrong feature for this use case. Let me rephrase my question this way - Suppose a model wants to predict the gender of a person. The features choose would be name and country (sometimes a same name has different gender in different country). So a training set ( where y (i)= { 0 ,1} and X (i) = {somename, somecountry}, for ith training set), how would I plot the graph so that I can visualize the decision boundary ?
– Arka Roy
Nov 9 at 19:57












Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
– Arka Roy
Nov 9 at 19:59




Wrt to original question I wanted to predict given name X(1), leaves left X(2), Day X(3) what would be my probability that the person would take a leave. But I understood that in this case name X(1) should be the classes itself or I built a model for each names.
– Arka Roy
Nov 9 at 19:59












@Arka This should be a new different question. Accept this answer for now please.
– Yahya
Nov 9 at 22:07




@Arka This should be a new different question. Accept this answer for now please.
– Yahya
Nov 9 at 22:07












Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
– Arka Roy
Nov 10 at 18:25




Done. Thanks for the help. I read about one hot encoding as one of the way to address the above problem as well.
– Arka Roy
Nov 10 at 18:25


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223252%2flogistic-regression-with-non-integer-feature-value%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Xamarin.iOS Cant Deploy on Iphone

Glorious Revolution

Dulmage-Mendelsohn matrix decomposition in Python