Re-assign column values in a pandas df











up vote
18
down vote

favorite
7












This question is related to rostering or staffing. I'm trying to assign various jobs to individuals (employees). Using the df below,



`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time


So [Area] and [Place] together will make up unique values that are different jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible. The most unique values assigned to any one individual is 3. [On] displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,



1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc


Question:
Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc. I want to group unique values in [Area] and [Place] by [Area]. So look to assign same values in [Area] to an individual (up to 3). Then, if there are leftover values (<3), they should be combined to make a group of 3, where possible.



The way I envisage this working is: see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required). Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.



Putting that into a step by step process:



1) Use the [On] Column to determine how many individuals are required by looking into the future for an hour



2) Where there are more than 3 unique values occurring assign the identical values in [Area] first.



3) If there are any leftover values then look to combine anyway possible.



For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.



import pandas as pd
import numpy as np

d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})

df = pd.DataFrame(data=d)


This is my attempt:



def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'

# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df

df1 = (reduce_df(reduce_df(df)))


This is the Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7


This is my Intended Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3


Description on how I want to get this output:



Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3


Example No2:



d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
})

df = pd.DataFrame(data=d)


I am getting an error:



 IndexError: index 1 is out of bounds for axis 1 with size 1


On this line:



df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]


However, if I change the Person to 1,2,3 repeating, it returns the following:



'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'], 

Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 2
4 8:35:00 House 2 X 3 Person 2
5 8:40:00 House 3 X 3 Person 2
6 8:42:00 House 1 X 3 Person 3
7 8:45:00 House 2 X 3 Person 3
8 8:50:00 House 3 X 3 Person 3


Intended Output:



      Time    Place Area On    Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1


The main takeaway from Example 2 is:



1) There are <3 unique values on so assign to individual 1









share|improve this question




















  • 4




    I am confused about the desired output, why is Person 4 not part of your desired output? I think the constraints of the problem aren't super clear
    – djakubosky
    Oct 30 at 1:45








  • 2




    @PeterJames123 Your output looks good to me. You have 3 individuals in total as you needed.(Person1, Person2 and Person4). Why you can't use this output? If order of the person is important you can check the order and see Person3 is missing then you can replace Person4 by Person3 and s on....
    – jimmy
    Oct 31 at 6:10






  • 8




    In your question you say that the values' uniqueness depend on Hour and Area. At the same time you say that your input d has 9 unique values "occuring within the hour" even though it is 9 values (observations) split into multiple hours and 7 unique areas? I don't follow. There are also many different explanations in your question (and here in the comment section). I think it would be beneficial for you to review them and choose one.
    – user3471881
    Nov 2 at 9:47








  • 2




    I'm also having a hard time understanding problem. I think it might help if you gave some context. What does each row of the data represent? What are the places, and who are the people? Is this about assigning staff to resources?
    – grge
    Nov 4 at 1:32








  • 2




    I don't doubt that I (and others) misunderstood your criteria, but maybe that is indicative of your description being unclear? I would take @TomWojcik:s advice and restructure your question. Focus on 1) input and 2) expected output. You now show multiple inputs (3, from my count) which confuses things even more. Try to show one input and one output that represent your problem.
    – user3471881
    Nov 5 at 9:04

















up vote
18
down vote

favorite
7












This question is related to rostering or staffing. I'm trying to assign various jobs to individuals (employees). Using the df below,



`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time


So [Area] and [Place] together will make up unique values that are different jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible. The most unique values assigned to any one individual is 3. [On] displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,



1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc


Question:
Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc. I want to group unique values in [Area] and [Place] by [Area]. So look to assign same values in [Area] to an individual (up to 3). Then, if there are leftover values (<3), they should be combined to make a group of 3, where possible.



The way I envisage this working is: see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required). Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.



Putting that into a step by step process:



1) Use the [On] Column to determine how many individuals are required by looking into the future for an hour



2) Where there are more than 3 unique values occurring assign the identical values in [Area] first.



3) If there are any leftover values then look to combine anyway possible.



For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.



import pandas as pd
import numpy as np

d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})

df = pd.DataFrame(data=d)


This is my attempt:



def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'

# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df

df1 = (reduce_df(reduce_df(df)))


This is the Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7


This is my Intended Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3


Description on how I want to get this output:



Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3


Example No2:



d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
})

df = pd.DataFrame(data=d)


I am getting an error:



 IndexError: index 1 is out of bounds for axis 1 with size 1


On this line:



df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]


However, if I change the Person to 1,2,3 repeating, it returns the following:



'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'], 

Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 2
4 8:35:00 House 2 X 3 Person 2
5 8:40:00 House 3 X 3 Person 2
6 8:42:00 House 1 X 3 Person 3
7 8:45:00 House 2 X 3 Person 3
8 8:50:00 House 3 X 3 Person 3


Intended Output:



      Time    Place Area On    Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1


The main takeaway from Example 2 is:



1) There are <3 unique values on so assign to individual 1









share|improve this question




















  • 4




    I am confused about the desired output, why is Person 4 not part of your desired output? I think the constraints of the problem aren't super clear
    – djakubosky
    Oct 30 at 1:45








  • 2




    @PeterJames123 Your output looks good to me. You have 3 individuals in total as you needed.(Person1, Person2 and Person4). Why you can't use this output? If order of the person is important you can check the order and see Person3 is missing then you can replace Person4 by Person3 and s on....
    – jimmy
    Oct 31 at 6:10






  • 8




    In your question you say that the values' uniqueness depend on Hour and Area. At the same time you say that your input d has 9 unique values "occuring within the hour" even though it is 9 values (observations) split into multiple hours and 7 unique areas? I don't follow. There are also many different explanations in your question (and here in the comment section). I think it would be beneficial for you to review them and choose one.
    – user3471881
    Nov 2 at 9:47








  • 2




    I'm also having a hard time understanding problem. I think it might help if you gave some context. What does each row of the data represent? What are the places, and who are the people? Is this about assigning staff to resources?
    – grge
    Nov 4 at 1:32








  • 2




    I don't doubt that I (and others) misunderstood your criteria, but maybe that is indicative of your description being unclear? I would take @TomWojcik:s advice and restructure your question. Focus on 1) input and 2) expected output. You now show multiple inputs (3, from my count) which confuses things even more. Try to show one input and one output that represent your problem.
    – user3471881
    Nov 5 at 9:04















up vote
18
down vote

favorite
7









up vote
18
down vote

favorite
7






7





This question is related to rostering or staffing. I'm trying to assign various jobs to individuals (employees). Using the df below,



`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time


So [Area] and [Place] together will make up unique values that are different jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible. The most unique values assigned to any one individual is 3. [On] displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,



1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc


Question:
Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc. I want to group unique values in [Area] and [Place] by [Area]. So look to assign same values in [Area] to an individual (up to 3). Then, if there are leftover values (<3), they should be combined to make a group of 3, where possible.



The way I envisage this working is: see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required). Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.



Putting that into a step by step process:



1) Use the [On] Column to determine how many individuals are required by looking into the future for an hour



2) Where there are more than 3 unique values occurring assign the identical values in [Area] first.



3) If there are any leftover values then look to combine anyway possible.



For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.



import pandas as pd
import numpy as np

d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})

df = pd.DataFrame(data=d)


This is my attempt:



def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'

# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df

df1 = (reduce_df(reduce_df(df)))


This is the Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7


This is my Intended Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3


Description on how I want to get this output:



Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3


Example No2:



d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
})

df = pd.DataFrame(data=d)


I am getting an error:



 IndexError: index 1 is out of bounds for axis 1 with size 1


On this line:



df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]


However, if I change the Person to 1,2,3 repeating, it returns the following:



'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'], 

Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 2
4 8:35:00 House 2 X 3 Person 2
5 8:40:00 House 3 X 3 Person 2
6 8:42:00 House 1 X 3 Person 3
7 8:45:00 House 2 X 3 Person 3
8 8:50:00 House 3 X 3 Person 3


Intended Output:



      Time    Place Area On    Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1


The main takeaway from Example 2 is:



1) There are <3 unique values on so assign to individual 1









share|improve this question















This question is related to rostering or staffing. I'm trying to assign various jobs to individuals (employees). Using the df below,



`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time


So [Area] and [Place] together will make up unique values that are different jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible. The most unique values assigned to any one individual is 3. [On] displays how many current unique values for [Place] and [Area] are occurring. So this provides a concrete guide on how many individuals I need. For example,



1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc


Question:
Where the amount of unique values in [Area] and [Place] is greater than 3 is causing me trouble. I can't do a groupby where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc. I want to group unique values in [Area] and [Place] by [Area]. So look to assign same values in [Area] to an individual (up to 3). Then, if there are leftover values (<3), they should be combined to make a group of 3, where possible.



The way I envisage this working is: see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required). Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.



Putting that into a step by step process:



1) Use the [On] Column to determine how many individuals are required by looking into the future for an hour



2) Where there are more than 3 unique values occurring assign the identical values in [Area] first.



3) If there are any leftover values then look to combine anyway possible.



For the df below, there are 9 unique values occurring for [Place] and [Area] with an hour. So we should have 3 individuals assigned. When unique values >3 it should be assigned by [Area] and seeing if the same value occurs. The leftover values should be combined with other individuals that have less than 3 unique values.



import pandas as pd
import numpy as np

d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})

df = pd.DataFrame(data=d)


This is my attempt:



def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'

# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df

df1 = (reduce_df(reduce_df(df)))


This is the Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7


This is my Intended Output:



       Time    Place Area On    Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3


Description on how I want to get this output:



Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3


Example No2:



d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
})

df = pd.DataFrame(data=d)


I am getting an error:



 IndexError: index 1 is out of bounds for axis 1 with size 1


On this line:



df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]


However, if I change the Person to 1,2,3 repeating, it returns the following:



'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'], 

Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 2
4 8:35:00 House 2 X 3 Person 2
5 8:40:00 House 3 X 3 Person 2
6 8:42:00 House 1 X 3 Person 3
7 8:45:00 House 2 X 3 Person 3
8 8:50:00 House 3 X 3 Person 3


Intended Output:



      Time    Place Area On    Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1


The main takeaway from Example 2 is:



1) There are <3 unique values on so assign to individual 1






python pandas numpy dataframe assign






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 at 2:46

























asked Oct 10 at 0:04









PeterJames123

7113




7113








  • 4




    I am confused about the desired output, why is Person 4 not part of your desired output? I think the constraints of the problem aren't super clear
    – djakubosky
    Oct 30 at 1:45








  • 2




    @PeterJames123 Your output looks good to me. You have 3 individuals in total as you needed.(Person1, Person2 and Person4). Why you can't use this output? If order of the person is important you can check the order and see Person3 is missing then you can replace Person4 by Person3 and s on....
    – jimmy
    Oct 31 at 6:10






  • 8




    In your question you say that the values' uniqueness depend on Hour and Area. At the same time you say that your input d has 9 unique values "occuring within the hour" even though it is 9 values (observations) split into multiple hours and 7 unique areas? I don't follow. There are also many different explanations in your question (and here in the comment section). I think it would be beneficial for you to review them and choose one.
    – user3471881
    Nov 2 at 9:47








  • 2




    I'm also having a hard time understanding problem. I think it might help if you gave some context. What does each row of the data represent? What are the places, and who are the people? Is this about assigning staff to resources?
    – grge
    Nov 4 at 1:32








  • 2




    I don't doubt that I (and others) misunderstood your criteria, but maybe that is indicative of your description being unclear? I would take @TomWojcik:s advice and restructure your question. Focus on 1) input and 2) expected output. You now show multiple inputs (3, from my count) which confuses things even more. Try to show one input and one output that represent your problem.
    – user3471881
    Nov 5 at 9:04
















  • 4




    I am confused about the desired output, why is Person 4 not part of your desired output? I think the constraints of the problem aren't super clear
    – djakubosky
    Oct 30 at 1:45








  • 2




    @PeterJames123 Your output looks good to me. You have 3 individuals in total as you needed.(Person1, Person2 and Person4). Why you can't use this output? If order of the person is important you can check the order and see Person3 is missing then you can replace Person4 by Person3 and s on....
    – jimmy
    Oct 31 at 6:10






  • 8




    In your question you say that the values' uniqueness depend on Hour and Area. At the same time you say that your input d has 9 unique values "occuring within the hour" even though it is 9 values (observations) split into multiple hours and 7 unique areas? I don't follow. There are also many different explanations in your question (and here in the comment section). I think it would be beneficial for you to review them and choose one.
    – user3471881
    Nov 2 at 9:47








  • 2




    I'm also having a hard time understanding problem. I think it might help if you gave some context. What does each row of the data represent? What are the places, and who are the people? Is this about assigning staff to resources?
    – grge
    Nov 4 at 1:32








  • 2




    I don't doubt that I (and others) misunderstood your criteria, but maybe that is indicative of your description being unclear? I would take @TomWojcik:s advice and restructure your question. Focus on 1) input and 2) expected output. You now show multiple inputs (3, from my count) which confuses things even more. Try to show one input and one output that represent your problem.
    – user3471881
    Nov 5 at 9:04










4




4




I am confused about the desired output, why is Person 4 not part of your desired output? I think the constraints of the problem aren't super clear
– djakubosky
Oct 30 at 1:45






I am confused about the desired output, why is Person 4 not part of your desired output? I think the constraints of the problem aren't super clear
– djakubosky
Oct 30 at 1:45






2




2




@PeterJames123 Your output looks good to me. You have 3 individuals in total as you needed.(Person1, Person2 and Person4). Why you can't use this output? If order of the person is important you can check the order and see Person3 is missing then you can replace Person4 by Person3 and s on....
– jimmy
Oct 31 at 6:10




@PeterJames123 Your output looks good to me. You have 3 individuals in total as you needed.(Person1, Person2 and Person4). Why you can't use this output? If order of the person is important you can check the order and see Person3 is missing then you can replace Person4 by Person3 and s on....
– jimmy
Oct 31 at 6:10




8




8




In your question you say that the values' uniqueness depend on Hour and Area. At the same time you say that your input d has 9 unique values "occuring within the hour" even though it is 9 values (observations) split into multiple hours and 7 unique areas? I don't follow. There are also many different explanations in your question (and here in the comment section). I think it would be beneficial for you to review them and choose one.
– user3471881
Nov 2 at 9:47






In your question you say that the values' uniqueness depend on Hour and Area. At the same time you say that your input d has 9 unique values "occuring within the hour" even though it is 9 values (observations) split into multiple hours and 7 unique areas? I don't follow. There are also many different explanations in your question (and here in the comment section). I think it would be beneficial for you to review them and choose one.
– user3471881
Nov 2 at 9:47






2




2




I'm also having a hard time understanding problem. I think it might help if you gave some context. What does each row of the data represent? What are the places, and who are the people? Is this about assigning staff to resources?
– grge
Nov 4 at 1:32






I'm also having a hard time understanding problem. I think it might help if you gave some context. What does each row of the data represent? What are the places, and who are the people? Is this about assigning staff to resources?
– grge
Nov 4 at 1:32






2




2




I don't doubt that I (and others) misunderstood your criteria, but maybe that is indicative of your description being unclear? I would take @TomWojcik:s advice and restructure your question. Focus on 1) input and 2) expected output. You now show multiple inputs (3, from my count) which confuses things even more. Try to show one input and one output that represent your problem.
– user3471881
Nov 5 at 9:04






I don't doubt that I (and others) misunderstood your criteria, but maybe that is indicative of your description being unclear? I would take @TomWojcik:s advice and restructure your question. Focus on 1) input and 2) expected output. You now show multiple inputs (3, from my count) which confuses things even more. Try to show one input and one output that represent your problem.
– user3471881
Nov 5 at 9:04














3 Answers
3






active

oldest

votes

















up vote
5
down vote



accepted
+200










Update



There's a live version of this answer online that you can try for yourself.



Here's an answer in the form of the allocatePeople function. It's based around precomputing all of the indices where the areas repeat within an hour:



from collections import Counter
import numpy as np
import pandas as pd

def getAssignedPeople(df, areasPerPerson):
areas = df['Area'].values
places = df['Place'].values
times = pd.to_datetime(df['Time']).values
maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
assignmentCount = Counter()
assignedPeople =
assignedPlaces = {}
heldPeople = {}
heldAreas = {}
holdAvailable = True
person = 0

# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

for area,place,hold in zip(areas, places, holds):
if (area, place) in assignedPlaces:
# this unique (area, place) has already been assigned to someone
assignedPeople.append(assignedPlaces[(area, place)])
continue

if assignmentCount[person] >= areasPerPerson:
# the current person is already assigned to enough areas, move on to the next
a = heldPeople.pop(person, None)
heldAreas.pop(a, None)
person += 1

if area in heldAreas:
# assign to the person held in this area
p = heldAreas.pop(area)
heldPeople.pop(p)
else:
# get the first non-held person. If we need to hold in this area,
# also make sure the person has at least 2 free assignment slots,
# though if it's the last person assign to them anyway
p = person
while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
p += 1

assignmentCount.update([p])
assignedPlaces[(area, place)] = p
assignedPeople.append(p)

if hold:
if p==maxPerson:
# mark that there are no more people available to perform holds
holdAvailable = False

# this area recurrs in an hour, mark that the person should be held here
heldPeople[p] = area
heldAreas[area] = p

return assignedPeople

def allocatePeople(df, areasPerPerson=3):
assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
df = df.copy()
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
return df


Note the use of df['Person'].unique() in allocatePeople. That handles the case where people are repeated in the input. It is assumed that the order of people in the input is the desired order in which those people should be assigned.



I tested allocatePeople against the OP's example input (example1 and example2) and also against a couple of edge cases I came up with that I think(?) match the OP's desired algorithm:



ds = dict(
example1 = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
}),
example2 = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
}),

long_repeats = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
'Area' : ['A','A','A','A','B','C','C','C','B'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
'On' : ['1','2','3','4','5','6','7','8','9'],
}),
many_repeats = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
}),
large_gap = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
}),
different_times = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
})
)

expectedPeoples = dict(
example1 = [1,1,1,2,3,2,3,2,3],
example2 = [1,1,1,1,1,1,1,1,1],
long_repeats = [1,1,1,2,2,3,3,3,2],
many_repeats = [1,1,1,2,2,3,3,2,3],
large_gap = [1,1,1,2,3,3,2,2,3],
different_times = [1,1,1,2,2,2,3,3,3],
)

for name,d in ds.items():
df = pd.DataFrame(d)
expected = ['Person %d' % i for i in expectedPeoples[name]]
ap = allocatePeople(df)

print(name, ap, sep='n', end='nn')
np.testing.assert_array_equal(ap['Person'], expected)


The assert_array_equal statements pass, and the output matches OP's expected output:



example1
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 08:40:00 House 1 D 6 Person 2
6 08:42:00 House 2 E 7 Person 3
7 08:45:00 House 3 F 8 Person 2
8 08:50:00 House 2 G 9 Person 3

example2
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1


The output for my test cases matches my expectations as well:



long_repeats
Time Place Area Person On
0 8:03:00 House 1 A Person 1 1
1 8:17:00 House 2 A Person 1 2
2 8:20:00 House 3 A Person 1 3
3 8:25:00 House 4 A Person 2 4
4 8:30:00 House 1 B Person 2 5
5 8:31:00 House 1 C Person 3 6
6 8:35:00 House 2 C Person 3 7
7 8:45:00 House 3 C Person 3 8
8 8:50:00 House 2 B Person 2 9

many_repeats
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 D 5 Person 2
5 08:40:00 House 1 E 6 Person 3
6 08:42:00 House 2 E 7 Person 3
7 08:45:00 House 1 F 8 Person 2
8 08:50:00 House 2 F 9 Person 3

large_gap
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 E 5 Person 3
5 08:40:00 House 1 F 6 Person 3
6 08:42:00 House 2 D 7 Person 2
7 08:45:00 House 1 D 8 Person 2
8 08:50:00 House 3 D 9 Person 3

different_times
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 D 5 Person 2
5 08:40:00 House 1 E 6 Person 2
6 09:42:00 House 2 E 7 Person 3
7 09:45:00 House 1 F 8 Person 3
8 09:50:00 House 1 G 9 Person 3


Let me know if it does everything you wanted, or if it still needs some tweaks. I think everyone is eager to see you fulfill your vision.






share|improve this answer























  • @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
    – tel
    Nov 12 at 0:06












  • thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
    – PeterJames123
    Nov 12 at 1:47










  • in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
    – PeterJames123
    Nov 12 at 1:48










  • @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
    – tel
    Nov 12 at 6:15










  • It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
    – PeterJames123
    Nov 12 at 7:17


















up vote
1
down vote













Ok, before we delve into the logic of the problem it is worthwhile to do some housekeeping to tidy-up the data and bring it into a more useful format:



#Create table of unique people
unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)

#Reformat time column
df['Time'] = pd.to_datetime(df['Time'])


Now, getting to the logic of the problem, it is useful to break the problem down in to stages. Firstly, we will want to create individual jobs (with job numbers) based on the 'Area' and the time between them. i.e. jobs in the same area, within an hour can share the same job number.



#Assign jobs
df= df.sort_values(['Area','Time']).reset_index(drop=True)
df['Job no'] = 0
current_job = 1
df.loc[0,'Job no'] = current_job
for i in range(rows-1):
prev_row = df.loc[i]
row = df.loc[i+1]
time_diff = (row['Time'] - prev_row['Time']).seconds //3600
if (row['Area'] == prev_row['Area']) & (time_diff == 0):
pass
else:
current_job +=1
df.loc[i+1,'Job no'] = current_job


With this step now out of the way, it is a simple matter of assigning 'Persons' to individual jobs:



df= df.sort_values(['Job no']).reset_index(drop=True)
df['Person'] = ""
df_groups = df.groupby('Job no')
for group in df_groups:
group_size = group[1].count()['Time']
for person_idx in range(len(unique_people)):
person = unique_people.loc[person_idx]['Person']
person_count = df[df['Person']==person]['Person'].count()
if group_size <= (3-person_count):
idx = group[1].index.values
df.loc[idx,'Person'] = person
break


And finally,



df= df.sort_values(['Time']).reset_index(drop=True)
print(df)


I've attempted to code this in a way that is easier to unpick, so there may well be efficiencies to be made here. The aim however was to set out the logic used.



This code gives the expected results on both data sets, so I hope it answers your question.






share|improve this answer




























    up vote
    0
    down vote













    In writing my other answer, I slowly came around to the idea that the OP's algorithm might be easier to implement with an approach that focuses on the jobs (which can be different), instead of the people (which are all the same). Here's a solution that uses the job-centric approach:



    from collections import Counter
    import numpy as np
    import pandas as pd

    def assignJob(job, assignedix, areasPerPerson):
    for i in range(len(assignedix)):
    if (areasPerPerson - len(assignedix[i])) >= len(job):
    assignedix[i].extend(job)
    return True
    else:
    return False

    def allocatePeople(df, areasPerPerson=3):
    areas = df['Area'].values
    times = pd.to_datetime(df['Time']).values
    peopleUniq = df['Person'].unique()
    npeople = int(np.ceil(areas.size / float(areasPerPerson)))

    # search for repeated areas. Mark them if the next repeat occurs within an hour
    ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
    holds = np.zeros(areas.size, dtype=bool)
    holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

    jobs =
    _jobdict = {}
    for i,(area,hold) in enumerate(zip(areas, holds)):
    if hold:
    _jobdict[area] = job = _jobdict.get(area, ) + [i]
    if len(job)==areasPerPerson:
    jobs.append(_jobdict.pop(area))
    elif area in _jobdict:
    jobs.append(_jobdict.pop(area) + [i])
    else:
    jobs.append([i])
    jobs.sort()

    assignedix = [ for i in range(npeople)]
    for job in jobs:
    if not assignJob(job, assignedix, areasPerPerson):
    # break the job up and try again
    for subjob in ([sj] for sj in job):
    assignJob(subjob, assignedix, areasPerPerson)

    df = df.copy()
    for i,aix in enumerate(assignedix):
    df.loc[aix, 'Person'] = peopleUniq[i]
    return df


    This version of allocatePeople has also been extensively tested and passes all of the same checks described in my other answer.



    It does have more looping than my other solution, so it is likely to be slightly less efficient (though it'll only matter if your dataframe is very large, say 1e6 rows and up). On the other hand, it is somewhat shorter and, I think, more straightforward and easy to understand.






    share|improve this answer





















      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52730864%2fre-assign-column-values-in-a-pandas-df%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      5
      down vote



      accepted
      +200










      Update



      There's a live version of this answer online that you can try for yourself.



      Here's an answer in the form of the allocatePeople function. It's based around precomputing all of the indices where the areas repeat within an hour:



      from collections import Counter
      import numpy as np
      import pandas as pd

      def getAssignedPeople(df, areasPerPerson):
      areas = df['Area'].values
      places = df['Place'].values
      times = pd.to_datetime(df['Time']).values
      maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
      assignmentCount = Counter()
      assignedPeople =
      assignedPlaces = {}
      heldPeople = {}
      heldAreas = {}
      holdAvailable = True
      person = 0

      # search for repeated areas. Mark them if the next repeat occurs within an hour
      ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
      holds = np.zeros(areas.size, dtype=bool)
      holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

      for area,place,hold in zip(areas, places, holds):
      if (area, place) in assignedPlaces:
      # this unique (area, place) has already been assigned to someone
      assignedPeople.append(assignedPlaces[(area, place)])
      continue

      if assignmentCount[person] >= areasPerPerson:
      # the current person is already assigned to enough areas, move on to the next
      a = heldPeople.pop(person, None)
      heldAreas.pop(a, None)
      person += 1

      if area in heldAreas:
      # assign to the person held in this area
      p = heldAreas.pop(area)
      heldPeople.pop(p)
      else:
      # get the first non-held person. If we need to hold in this area,
      # also make sure the person has at least 2 free assignment slots,
      # though if it's the last person assign to them anyway
      p = person
      while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
      p += 1

      assignmentCount.update([p])
      assignedPlaces[(area, place)] = p
      assignedPeople.append(p)

      if hold:
      if p==maxPerson:
      # mark that there are no more people available to perform holds
      holdAvailable = False

      # this area recurrs in an hour, mark that the person should be held here
      heldPeople[p] = area
      heldAreas[area] = p

      return assignedPeople

      def allocatePeople(df, areasPerPerson=3):
      assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
      df = df.copy()
      df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
      return df


      Note the use of df['Person'].unique() in allocatePeople. That handles the case where people are repeated in the input. It is assumed that the order of people in the input is the desired order in which those people should be assigned.



      I tested allocatePeople against the OP's example input (example1 and example2) and also against a couple of edge cases I came up with that I think(?) match the OP's desired algorithm:



      ds = dict(
      example1 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
      'Area' : ['A','B','C','D','E','D','E','F','G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
      }),
      example2 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
      'Area' : ['X','X','X','X','X','X','X','X','X'],
      'On' : ['1','2','3','3','3','3','3','3','3'],
      'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
      }),

      long_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
      'Area' : ['A','A','A','A','B','C','C','C','B'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      }),
      many_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      large_gap = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
      'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      different_times = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      })
      )

      expectedPeoples = dict(
      example1 = [1,1,1,2,3,2,3,2,3],
      example2 = [1,1,1,1,1,1,1,1,1],
      long_repeats = [1,1,1,2,2,3,3,3,2],
      many_repeats = [1,1,1,2,2,3,3,2,3],
      large_gap = [1,1,1,2,3,3,2,2,3],
      different_times = [1,1,1,2,2,2,3,3,3],
      )

      for name,d in ds.items():
      df = pd.DataFrame(d)
      expected = ['Person %d' % i for i in expectedPeoples[name]]
      ap = allocatePeople(df)

      print(name, ap, sep='n', end='nn')
      np.testing.assert_array_equal(ap['Person'], expected)


      The assert_array_equal statements pass, and the output matches OP's expected output:



      example1
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 5 E 5 Person 3
      5 08:40:00 House 1 D 6 Person 2
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 3 F 8 Person 2
      8 08:50:00 House 2 G 9 Person 3

      example2
      Time Place Area On Person
      0 8:03:00 House 1 X 1 Person 1
      1 8:17:00 House 2 X 2 Person 1
      2 8:20:00 House 3 X 3 Person 1
      3 8:28:00 House 1 X 3 Person 1
      4 8:35:00 House 2 X 3 Person 1
      5 8:40:00 House 3 X 3 Person 1
      6 8:42:00 House 1 X 3 Person 1
      7 8:45:00 House 2 X 3 Person 1
      8 8:50:00 House 3 X 3 Person 1


      The output for my test cases matches my expectations as well:



      long_repeats
      Time Place Area Person On
      0 8:03:00 House 1 A Person 1 1
      1 8:17:00 House 2 A Person 1 2
      2 8:20:00 House 3 A Person 1 3
      3 8:25:00 House 4 A Person 2 4
      4 8:30:00 House 1 B Person 2 5
      5 8:31:00 House 1 C Person 3 6
      6 8:35:00 House 2 C Person 3 7
      7 8:45:00 House 3 C Person 3 8
      8 8:50:00 House 2 B Person 2 9

      many_repeats
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 3
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 1 F 8 Person 2
      8 08:50:00 House 2 F 9 Person 3

      large_gap
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 E 5 Person 3
      5 08:40:00 House 1 F 6 Person 3
      6 08:42:00 House 2 D 7 Person 2
      7 08:45:00 House 1 D 8 Person 2
      8 08:50:00 House 3 D 9 Person 3

      different_times
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 2
      6 09:42:00 House 2 E 7 Person 3
      7 09:45:00 House 1 F 8 Person 3
      8 09:50:00 House 1 G 9 Person 3


      Let me know if it does everything you wanted, or if it still needs some tweaks. I think everyone is eager to see you fulfill your vision.






      share|improve this answer























      • @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
        – tel
        Nov 12 at 0:06












      • thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
        – PeterJames123
        Nov 12 at 1:47










      • in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
        – PeterJames123
        Nov 12 at 1:48










      • @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
        – tel
        Nov 12 at 6:15










      • It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
        – PeterJames123
        Nov 12 at 7:17















      up vote
      5
      down vote



      accepted
      +200










      Update



      There's a live version of this answer online that you can try for yourself.



      Here's an answer in the form of the allocatePeople function. It's based around precomputing all of the indices where the areas repeat within an hour:



      from collections import Counter
      import numpy as np
      import pandas as pd

      def getAssignedPeople(df, areasPerPerson):
      areas = df['Area'].values
      places = df['Place'].values
      times = pd.to_datetime(df['Time']).values
      maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
      assignmentCount = Counter()
      assignedPeople =
      assignedPlaces = {}
      heldPeople = {}
      heldAreas = {}
      holdAvailable = True
      person = 0

      # search for repeated areas. Mark them if the next repeat occurs within an hour
      ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
      holds = np.zeros(areas.size, dtype=bool)
      holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

      for area,place,hold in zip(areas, places, holds):
      if (area, place) in assignedPlaces:
      # this unique (area, place) has already been assigned to someone
      assignedPeople.append(assignedPlaces[(area, place)])
      continue

      if assignmentCount[person] >= areasPerPerson:
      # the current person is already assigned to enough areas, move on to the next
      a = heldPeople.pop(person, None)
      heldAreas.pop(a, None)
      person += 1

      if area in heldAreas:
      # assign to the person held in this area
      p = heldAreas.pop(area)
      heldPeople.pop(p)
      else:
      # get the first non-held person. If we need to hold in this area,
      # also make sure the person has at least 2 free assignment slots,
      # though if it's the last person assign to them anyway
      p = person
      while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
      p += 1

      assignmentCount.update([p])
      assignedPlaces[(area, place)] = p
      assignedPeople.append(p)

      if hold:
      if p==maxPerson:
      # mark that there are no more people available to perform holds
      holdAvailable = False

      # this area recurrs in an hour, mark that the person should be held here
      heldPeople[p] = area
      heldAreas[area] = p

      return assignedPeople

      def allocatePeople(df, areasPerPerson=3):
      assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
      df = df.copy()
      df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
      return df


      Note the use of df['Person'].unique() in allocatePeople. That handles the case where people are repeated in the input. It is assumed that the order of people in the input is the desired order in which those people should be assigned.



      I tested allocatePeople against the OP's example input (example1 and example2) and also against a couple of edge cases I came up with that I think(?) match the OP's desired algorithm:



      ds = dict(
      example1 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
      'Area' : ['A','B','C','D','E','D','E','F','G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
      }),
      example2 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
      'Area' : ['X','X','X','X','X','X','X','X','X'],
      'On' : ['1','2','3','3','3','3','3','3','3'],
      'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
      }),

      long_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
      'Area' : ['A','A','A','A','B','C','C','C','B'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      }),
      many_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      large_gap = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
      'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      different_times = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      })
      )

      expectedPeoples = dict(
      example1 = [1,1,1,2,3,2,3,2,3],
      example2 = [1,1,1,1,1,1,1,1,1],
      long_repeats = [1,1,1,2,2,3,3,3,2],
      many_repeats = [1,1,1,2,2,3,3,2,3],
      large_gap = [1,1,1,2,3,3,2,2,3],
      different_times = [1,1,1,2,2,2,3,3,3],
      )

      for name,d in ds.items():
      df = pd.DataFrame(d)
      expected = ['Person %d' % i for i in expectedPeoples[name]]
      ap = allocatePeople(df)

      print(name, ap, sep='n', end='nn')
      np.testing.assert_array_equal(ap['Person'], expected)


      The assert_array_equal statements pass, and the output matches OP's expected output:



      example1
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 5 E 5 Person 3
      5 08:40:00 House 1 D 6 Person 2
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 3 F 8 Person 2
      8 08:50:00 House 2 G 9 Person 3

      example2
      Time Place Area On Person
      0 8:03:00 House 1 X 1 Person 1
      1 8:17:00 House 2 X 2 Person 1
      2 8:20:00 House 3 X 3 Person 1
      3 8:28:00 House 1 X 3 Person 1
      4 8:35:00 House 2 X 3 Person 1
      5 8:40:00 House 3 X 3 Person 1
      6 8:42:00 House 1 X 3 Person 1
      7 8:45:00 House 2 X 3 Person 1
      8 8:50:00 House 3 X 3 Person 1


      The output for my test cases matches my expectations as well:



      long_repeats
      Time Place Area Person On
      0 8:03:00 House 1 A Person 1 1
      1 8:17:00 House 2 A Person 1 2
      2 8:20:00 House 3 A Person 1 3
      3 8:25:00 House 4 A Person 2 4
      4 8:30:00 House 1 B Person 2 5
      5 8:31:00 House 1 C Person 3 6
      6 8:35:00 House 2 C Person 3 7
      7 8:45:00 House 3 C Person 3 8
      8 8:50:00 House 2 B Person 2 9

      many_repeats
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 3
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 1 F 8 Person 2
      8 08:50:00 House 2 F 9 Person 3

      large_gap
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 E 5 Person 3
      5 08:40:00 House 1 F 6 Person 3
      6 08:42:00 House 2 D 7 Person 2
      7 08:45:00 House 1 D 8 Person 2
      8 08:50:00 House 3 D 9 Person 3

      different_times
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 2
      6 09:42:00 House 2 E 7 Person 3
      7 09:45:00 House 1 F 8 Person 3
      8 09:50:00 House 1 G 9 Person 3


      Let me know if it does everything you wanted, or if it still needs some tweaks. I think everyone is eager to see you fulfill your vision.






      share|improve this answer























      • @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
        – tel
        Nov 12 at 0:06












      • thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
        – PeterJames123
        Nov 12 at 1:47










      • in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
        – PeterJames123
        Nov 12 at 1:48










      • @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
        – tel
        Nov 12 at 6:15










      • It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
        – PeterJames123
        Nov 12 at 7:17













      up vote
      5
      down vote



      accepted
      +200







      up vote
      5
      down vote



      accepted
      +200




      +200




      Update



      There's a live version of this answer online that you can try for yourself.



      Here's an answer in the form of the allocatePeople function. It's based around precomputing all of the indices where the areas repeat within an hour:



      from collections import Counter
      import numpy as np
      import pandas as pd

      def getAssignedPeople(df, areasPerPerson):
      areas = df['Area'].values
      places = df['Place'].values
      times = pd.to_datetime(df['Time']).values
      maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
      assignmentCount = Counter()
      assignedPeople =
      assignedPlaces = {}
      heldPeople = {}
      heldAreas = {}
      holdAvailable = True
      person = 0

      # search for repeated areas. Mark them if the next repeat occurs within an hour
      ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
      holds = np.zeros(areas.size, dtype=bool)
      holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

      for area,place,hold in zip(areas, places, holds):
      if (area, place) in assignedPlaces:
      # this unique (area, place) has already been assigned to someone
      assignedPeople.append(assignedPlaces[(area, place)])
      continue

      if assignmentCount[person] >= areasPerPerson:
      # the current person is already assigned to enough areas, move on to the next
      a = heldPeople.pop(person, None)
      heldAreas.pop(a, None)
      person += 1

      if area in heldAreas:
      # assign to the person held in this area
      p = heldAreas.pop(area)
      heldPeople.pop(p)
      else:
      # get the first non-held person. If we need to hold in this area,
      # also make sure the person has at least 2 free assignment slots,
      # though if it's the last person assign to them anyway
      p = person
      while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
      p += 1

      assignmentCount.update([p])
      assignedPlaces[(area, place)] = p
      assignedPeople.append(p)

      if hold:
      if p==maxPerson:
      # mark that there are no more people available to perform holds
      holdAvailable = False

      # this area recurrs in an hour, mark that the person should be held here
      heldPeople[p] = area
      heldAreas[area] = p

      return assignedPeople

      def allocatePeople(df, areasPerPerson=3):
      assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
      df = df.copy()
      df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
      return df


      Note the use of df['Person'].unique() in allocatePeople. That handles the case where people are repeated in the input. It is assumed that the order of people in the input is the desired order in which those people should be assigned.



      I tested allocatePeople against the OP's example input (example1 and example2) and also against a couple of edge cases I came up with that I think(?) match the OP's desired algorithm:



      ds = dict(
      example1 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
      'Area' : ['A','B','C','D','E','D','E','F','G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
      }),
      example2 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
      'Area' : ['X','X','X','X','X','X','X','X','X'],
      'On' : ['1','2','3','3','3','3','3','3','3'],
      'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
      }),

      long_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
      'Area' : ['A','A','A','A','B','C','C','C','B'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      }),
      many_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      large_gap = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
      'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      different_times = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      })
      )

      expectedPeoples = dict(
      example1 = [1,1,1,2,3,2,3,2,3],
      example2 = [1,1,1,1,1,1,1,1,1],
      long_repeats = [1,1,1,2,2,3,3,3,2],
      many_repeats = [1,1,1,2,2,3,3,2,3],
      large_gap = [1,1,1,2,3,3,2,2,3],
      different_times = [1,1,1,2,2,2,3,3,3],
      )

      for name,d in ds.items():
      df = pd.DataFrame(d)
      expected = ['Person %d' % i for i in expectedPeoples[name]]
      ap = allocatePeople(df)

      print(name, ap, sep='n', end='nn')
      np.testing.assert_array_equal(ap['Person'], expected)


      The assert_array_equal statements pass, and the output matches OP's expected output:



      example1
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 5 E 5 Person 3
      5 08:40:00 House 1 D 6 Person 2
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 3 F 8 Person 2
      8 08:50:00 House 2 G 9 Person 3

      example2
      Time Place Area On Person
      0 8:03:00 House 1 X 1 Person 1
      1 8:17:00 House 2 X 2 Person 1
      2 8:20:00 House 3 X 3 Person 1
      3 8:28:00 House 1 X 3 Person 1
      4 8:35:00 House 2 X 3 Person 1
      5 8:40:00 House 3 X 3 Person 1
      6 8:42:00 House 1 X 3 Person 1
      7 8:45:00 House 2 X 3 Person 1
      8 8:50:00 House 3 X 3 Person 1


      The output for my test cases matches my expectations as well:



      long_repeats
      Time Place Area Person On
      0 8:03:00 House 1 A Person 1 1
      1 8:17:00 House 2 A Person 1 2
      2 8:20:00 House 3 A Person 1 3
      3 8:25:00 House 4 A Person 2 4
      4 8:30:00 House 1 B Person 2 5
      5 8:31:00 House 1 C Person 3 6
      6 8:35:00 House 2 C Person 3 7
      7 8:45:00 House 3 C Person 3 8
      8 8:50:00 House 2 B Person 2 9

      many_repeats
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 3
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 1 F 8 Person 2
      8 08:50:00 House 2 F 9 Person 3

      large_gap
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 E 5 Person 3
      5 08:40:00 House 1 F 6 Person 3
      6 08:42:00 House 2 D 7 Person 2
      7 08:45:00 House 1 D 8 Person 2
      8 08:50:00 House 3 D 9 Person 3

      different_times
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 2
      6 09:42:00 House 2 E 7 Person 3
      7 09:45:00 House 1 F 8 Person 3
      8 09:50:00 House 1 G 9 Person 3


      Let me know if it does everything you wanted, or if it still needs some tweaks. I think everyone is eager to see you fulfill your vision.






      share|improve this answer














      Update



      There's a live version of this answer online that you can try for yourself.



      Here's an answer in the form of the allocatePeople function. It's based around precomputing all of the indices where the areas repeat within an hour:



      from collections import Counter
      import numpy as np
      import pandas as pd

      def getAssignedPeople(df, areasPerPerson):
      areas = df['Area'].values
      places = df['Place'].values
      times = pd.to_datetime(df['Time']).values
      maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
      assignmentCount = Counter()
      assignedPeople =
      assignedPlaces = {}
      heldPeople = {}
      heldAreas = {}
      holdAvailable = True
      person = 0

      # search for repeated areas. Mark them if the next repeat occurs within an hour
      ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
      holds = np.zeros(areas.size, dtype=bool)
      holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

      for area,place,hold in zip(areas, places, holds):
      if (area, place) in assignedPlaces:
      # this unique (area, place) has already been assigned to someone
      assignedPeople.append(assignedPlaces[(area, place)])
      continue

      if assignmentCount[person] >= areasPerPerson:
      # the current person is already assigned to enough areas, move on to the next
      a = heldPeople.pop(person, None)
      heldAreas.pop(a, None)
      person += 1

      if area in heldAreas:
      # assign to the person held in this area
      p = heldAreas.pop(area)
      heldPeople.pop(p)
      else:
      # get the first non-held person. If we need to hold in this area,
      # also make sure the person has at least 2 free assignment slots,
      # though if it's the last person assign to them anyway
      p = person
      while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
      p += 1

      assignmentCount.update([p])
      assignedPlaces[(area, place)] = p
      assignedPeople.append(p)

      if hold:
      if p==maxPerson:
      # mark that there are no more people available to perform holds
      holdAvailable = False

      # this area recurrs in an hour, mark that the person should be held here
      heldPeople[p] = area
      heldAreas[area] = p

      return assignedPeople

      def allocatePeople(df, areasPerPerson=3):
      assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
      df = df.copy()
      df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
      return df


      Note the use of df['Person'].unique() in allocatePeople. That handles the case where people are repeated in the input. It is assumed that the order of people in the input is the desired order in which those people should be assigned.



      I tested allocatePeople against the OP's example input (example1 and example2) and also against a couple of edge cases I came up with that I think(?) match the OP's desired algorithm:



      ds = dict(
      example1 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
      'Area' : ['A','B','C','D','E','D','E','F','G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
      }),
      example2 = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
      'Area' : ['X','X','X','X','X','X','X','X','X'],
      'On' : ['1','2','3','3','3','3','3','3','3'],
      'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
      }),

      long_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
      'Area' : ['A','A','A','A','B','C','C','C','B'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      }),
      many_repeats = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      large_gap = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
      'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      }),
      different_times = ({
      'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
      'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
      'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
      'On' : ['1','2','3','4','5','6','7','8','9'],
      'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
      })
      )

      expectedPeoples = dict(
      example1 = [1,1,1,2,3,2,3,2,3],
      example2 = [1,1,1,1,1,1,1,1,1],
      long_repeats = [1,1,1,2,2,3,3,3,2],
      many_repeats = [1,1,1,2,2,3,3,2,3],
      large_gap = [1,1,1,2,3,3,2,2,3],
      different_times = [1,1,1,2,2,2,3,3,3],
      )

      for name,d in ds.items():
      df = pd.DataFrame(d)
      expected = ['Person %d' % i for i in expectedPeoples[name]]
      ap = allocatePeople(df)

      print(name, ap, sep='n', end='nn')
      np.testing.assert_array_equal(ap['Person'], expected)


      The assert_array_equal statements pass, and the output matches OP's expected output:



      example1
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 5 E 5 Person 3
      5 08:40:00 House 1 D 6 Person 2
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 3 F 8 Person 2
      8 08:50:00 House 2 G 9 Person 3

      example2
      Time Place Area On Person
      0 8:03:00 House 1 X 1 Person 1
      1 8:17:00 House 2 X 2 Person 1
      2 8:20:00 House 3 X 3 Person 1
      3 8:28:00 House 1 X 3 Person 1
      4 8:35:00 House 2 X 3 Person 1
      5 8:40:00 House 3 X 3 Person 1
      6 8:42:00 House 1 X 3 Person 1
      7 8:45:00 House 2 X 3 Person 1
      8 8:50:00 House 3 X 3 Person 1


      The output for my test cases matches my expectations as well:



      long_repeats
      Time Place Area Person On
      0 8:03:00 House 1 A Person 1 1
      1 8:17:00 House 2 A Person 1 2
      2 8:20:00 House 3 A Person 1 3
      3 8:25:00 House 4 A Person 2 4
      4 8:30:00 House 1 B Person 2 5
      5 8:31:00 House 1 C Person 3 6
      6 8:35:00 House 2 C Person 3 7
      7 8:45:00 House 3 C Person 3 8
      8 8:50:00 House 2 B Person 2 9

      many_repeats
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 3
      6 08:42:00 House 2 E 7 Person 3
      7 08:45:00 House 1 F 8 Person 2
      8 08:50:00 House 2 F 9 Person 3

      large_gap
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 E 5 Person 3
      5 08:40:00 House 1 F 6 Person 3
      6 08:42:00 House 2 D 7 Person 2
      7 08:45:00 House 1 D 8 Person 2
      8 08:50:00 House 3 D 9 Person 3

      different_times
      Time Place Area On Person
      0 8:03:00 House 1 A 1 Person 1
      1 8:17:00 House 2 B 2 Person 1
      2 8:20:00 House 3 C 3 Person 1
      3 8:28:00 House 4 D 4 Person 2
      4 8:35:00 House 1 D 5 Person 2
      5 08:40:00 House 1 E 6 Person 2
      6 09:42:00 House 2 E 7 Person 3
      7 09:45:00 House 1 F 8 Person 3
      8 09:50:00 House 1 G 9 Person 3


      Let me know if it does everything you wanted, or if it still needs some tweaks. I think everyone is eager to see you fulfill your vision.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 14 at 9:13

























      answered Nov 11 at 8:19









      tel

      3,5511427




      3,5511427












      • @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
        – tel
        Nov 12 at 0:06












      • thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
        – PeterJames123
        Nov 12 at 1:47










      • in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
        – PeterJames123
        Nov 12 at 1:48










      • @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
        – tel
        Nov 12 at 6:15










      • It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
        – PeterJames123
        Nov 12 at 7:17


















      • @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
        – tel
        Nov 12 at 0:06












      • thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
        – PeterJames123
        Nov 12 at 1:47










      • in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
        – PeterJames123
        Nov 12 at 1:48










      • @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
        – tel
        Nov 12 at 6:15










      • It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
        – PeterJames123
        Nov 12 at 7:17
















      @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
      – tel
      Nov 12 at 0:06






      @PeterJames123 As a point of clarification, if the areas in the input were ['A', 'B', 'C', 'D', 'E', 'D', 'F', 'D', 'G'], should the order of the people in the output be [1, 1, 1, 2, 2, 2, 3, 3, 3] or [1, 1, 1, 2, 3, 2, 3, 2, 3]?
      – tel
      Nov 12 at 0:06














      thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
      – PeterJames123
      Nov 12 at 1:47




      thank you for attempting this. I understand it's a difficult one. If they were within an hour of each other it would be [1,1,1,2,3,2,3,2,3]. The first 3 items in [Area] would be grouped first, all the D's would be grouped second, and the leftovers would be grouped third.
      – PeterJames123
      Nov 12 at 1:47












      in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
      – PeterJames123
      Nov 12 at 1:48




      in regards to your code it works on a few iterations but it seems to be assigning too few individuals for the amount of unique values occurring. I'll attach another example at the bottom of the question. Thank you for trying this again.
      – PeterJames123
      Nov 12 at 1:48












      @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
      – tel
      Nov 12 at 6:15




      @PeterJames123 I posted a second version with an algorithm that more closely matches your idea of looking ahead an hour to see if someone should stay where they are (see holdSieve in the getAssignedPeople function). Let me know if this one is good. If not, post more more example input/output data and we'll iterate again.
      – tel
      Nov 12 at 6:15












      It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
      – PeterJames123
      Nov 12 at 7:17




      It's looking really close @tel! I'm liking the look of this. Just posted a new iteration. The total amount of individuals assigned is correct. Just need to refine the allocation of Area
      – PeterJames123
      Nov 12 at 7:17












      up vote
      1
      down vote













      Ok, before we delve into the logic of the problem it is worthwhile to do some housekeeping to tidy-up the data and bring it into a more useful format:



      #Create table of unique people
      unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)

      #Reformat time column
      df['Time'] = pd.to_datetime(df['Time'])


      Now, getting to the logic of the problem, it is useful to break the problem down in to stages. Firstly, we will want to create individual jobs (with job numbers) based on the 'Area' and the time between them. i.e. jobs in the same area, within an hour can share the same job number.



      #Assign jobs
      df= df.sort_values(['Area','Time']).reset_index(drop=True)
      df['Job no'] = 0
      current_job = 1
      df.loc[0,'Job no'] = current_job
      for i in range(rows-1):
      prev_row = df.loc[i]
      row = df.loc[i+1]
      time_diff = (row['Time'] - prev_row['Time']).seconds //3600
      if (row['Area'] == prev_row['Area']) & (time_diff == 0):
      pass
      else:
      current_job +=1
      df.loc[i+1,'Job no'] = current_job


      With this step now out of the way, it is a simple matter of assigning 'Persons' to individual jobs:



      df= df.sort_values(['Job no']).reset_index(drop=True)
      df['Person'] = ""
      df_groups = df.groupby('Job no')
      for group in df_groups:
      group_size = group[1].count()['Time']
      for person_idx in range(len(unique_people)):
      person = unique_people.loc[person_idx]['Person']
      person_count = df[df['Person']==person]['Person'].count()
      if group_size <= (3-person_count):
      idx = group[1].index.values
      df.loc[idx,'Person'] = person
      break


      And finally,



      df= df.sort_values(['Time']).reset_index(drop=True)
      print(df)


      I've attempted to code this in a way that is easier to unpick, so there may well be efficiencies to be made here. The aim however was to set out the logic used.



      This code gives the expected results on both data sets, so I hope it answers your question.






      share|improve this answer

























        up vote
        1
        down vote













        Ok, before we delve into the logic of the problem it is worthwhile to do some housekeeping to tidy-up the data and bring it into a more useful format:



        #Create table of unique people
        unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)

        #Reformat time column
        df['Time'] = pd.to_datetime(df['Time'])


        Now, getting to the logic of the problem, it is useful to break the problem down in to stages. Firstly, we will want to create individual jobs (with job numbers) based on the 'Area' and the time between them. i.e. jobs in the same area, within an hour can share the same job number.



        #Assign jobs
        df= df.sort_values(['Area','Time']).reset_index(drop=True)
        df['Job no'] = 0
        current_job = 1
        df.loc[0,'Job no'] = current_job
        for i in range(rows-1):
        prev_row = df.loc[i]
        row = df.loc[i+1]
        time_diff = (row['Time'] - prev_row['Time']).seconds //3600
        if (row['Area'] == prev_row['Area']) & (time_diff == 0):
        pass
        else:
        current_job +=1
        df.loc[i+1,'Job no'] = current_job


        With this step now out of the way, it is a simple matter of assigning 'Persons' to individual jobs:



        df= df.sort_values(['Job no']).reset_index(drop=True)
        df['Person'] = ""
        df_groups = df.groupby('Job no')
        for group in df_groups:
        group_size = group[1].count()['Time']
        for person_idx in range(len(unique_people)):
        person = unique_people.loc[person_idx]['Person']
        person_count = df[df['Person']==person]['Person'].count()
        if group_size <= (3-person_count):
        idx = group[1].index.values
        df.loc[idx,'Person'] = person
        break


        And finally,



        df= df.sort_values(['Time']).reset_index(drop=True)
        print(df)


        I've attempted to code this in a way that is easier to unpick, so there may well be efficiencies to be made here. The aim however was to set out the logic used.



        This code gives the expected results on both data sets, so I hope it answers your question.






        share|improve this answer























          up vote
          1
          down vote










          up vote
          1
          down vote









          Ok, before we delve into the logic of the problem it is worthwhile to do some housekeeping to tidy-up the data and bring it into a more useful format:



          #Create table of unique people
          unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)

          #Reformat time column
          df['Time'] = pd.to_datetime(df['Time'])


          Now, getting to the logic of the problem, it is useful to break the problem down in to stages. Firstly, we will want to create individual jobs (with job numbers) based on the 'Area' and the time between them. i.e. jobs in the same area, within an hour can share the same job number.



          #Assign jobs
          df= df.sort_values(['Area','Time']).reset_index(drop=True)
          df['Job no'] = 0
          current_job = 1
          df.loc[0,'Job no'] = current_job
          for i in range(rows-1):
          prev_row = df.loc[i]
          row = df.loc[i+1]
          time_diff = (row['Time'] - prev_row['Time']).seconds //3600
          if (row['Area'] == prev_row['Area']) & (time_diff == 0):
          pass
          else:
          current_job +=1
          df.loc[i+1,'Job no'] = current_job


          With this step now out of the way, it is a simple matter of assigning 'Persons' to individual jobs:



          df= df.sort_values(['Job no']).reset_index(drop=True)
          df['Person'] = ""
          df_groups = df.groupby('Job no')
          for group in df_groups:
          group_size = group[1].count()['Time']
          for person_idx in range(len(unique_people)):
          person = unique_people.loc[person_idx]['Person']
          person_count = df[df['Person']==person]['Person'].count()
          if group_size <= (3-person_count):
          idx = group[1].index.values
          df.loc[idx,'Person'] = person
          break


          And finally,



          df= df.sort_values(['Time']).reset_index(drop=True)
          print(df)


          I've attempted to code this in a way that is easier to unpick, so there may well be efficiencies to be made here. The aim however was to set out the logic used.



          This code gives the expected results on both data sets, so I hope it answers your question.






          share|improve this answer












          Ok, before we delve into the logic of the problem it is worthwhile to do some housekeeping to tidy-up the data and bring it into a more useful format:



          #Create table of unique people
          unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)

          #Reformat time column
          df['Time'] = pd.to_datetime(df['Time'])


          Now, getting to the logic of the problem, it is useful to break the problem down in to stages. Firstly, we will want to create individual jobs (with job numbers) based on the 'Area' and the time between them. i.e. jobs in the same area, within an hour can share the same job number.



          #Assign jobs
          df= df.sort_values(['Area','Time']).reset_index(drop=True)
          df['Job no'] = 0
          current_job = 1
          df.loc[0,'Job no'] = current_job
          for i in range(rows-1):
          prev_row = df.loc[i]
          row = df.loc[i+1]
          time_diff = (row['Time'] - prev_row['Time']).seconds //3600
          if (row['Area'] == prev_row['Area']) & (time_diff == 0):
          pass
          else:
          current_job +=1
          df.loc[i+1,'Job no'] = current_job


          With this step now out of the way, it is a simple matter of assigning 'Persons' to individual jobs:



          df= df.sort_values(['Job no']).reset_index(drop=True)
          df['Person'] = ""
          df_groups = df.groupby('Job no')
          for group in df_groups:
          group_size = group[1].count()['Time']
          for person_idx in range(len(unique_people)):
          person = unique_people.loc[person_idx]['Person']
          person_count = df[df['Person']==person]['Person'].count()
          if group_size <= (3-person_count):
          idx = group[1].index.values
          df.loc[idx,'Person'] = person
          break


          And finally,



          df= df.sort_values(['Time']).reset_index(drop=True)
          print(df)


          I've attempted to code this in a way that is easier to unpick, so there may well be efficiencies to be made here. The aim however was to set out the logic used.



          This code gives the expected results on both data sets, so I hope it answers your question.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 at 14:04









          Colin Dickie

          51727




          51727






















              up vote
              0
              down vote













              In writing my other answer, I slowly came around to the idea that the OP's algorithm might be easier to implement with an approach that focuses on the jobs (which can be different), instead of the people (which are all the same). Here's a solution that uses the job-centric approach:



              from collections import Counter
              import numpy as np
              import pandas as pd

              def assignJob(job, assignedix, areasPerPerson):
              for i in range(len(assignedix)):
              if (areasPerPerson - len(assignedix[i])) >= len(job):
              assignedix[i].extend(job)
              return True
              else:
              return False

              def allocatePeople(df, areasPerPerson=3):
              areas = df['Area'].values
              times = pd.to_datetime(df['Time']).values
              peopleUniq = df['Person'].unique()
              npeople = int(np.ceil(areas.size / float(areasPerPerson)))

              # search for repeated areas. Mark them if the next repeat occurs within an hour
              ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
              holds = np.zeros(areas.size, dtype=bool)
              holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

              jobs =
              _jobdict = {}
              for i,(area,hold) in enumerate(zip(areas, holds)):
              if hold:
              _jobdict[area] = job = _jobdict.get(area, ) + [i]
              if len(job)==areasPerPerson:
              jobs.append(_jobdict.pop(area))
              elif area in _jobdict:
              jobs.append(_jobdict.pop(area) + [i])
              else:
              jobs.append([i])
              jobs.sort()

              assignedix = [ for i in range(npeople)]
              for job in jobs:
              if not assignJob(job, assignedix, areasPerPerson):
              # break the job up and try again
              for subjob in ([sj] for sj in job):
              assignJob(subjob, assignedix, areasPerPerson)

              df = df.copy()
              for i,aix in enumerate(assignedix):
              df.loc[aix, 'Person'] = peopleUniq[i]
              return df


              This version of allocatePeople has also been extensively tested and passes all of the same checks described in my other answer.



              It does have more looping than my other solution, so it is likely to be slightly less efficient (though it'll only matter if your dataframe is very large, say 1e6 rows and up). On the other hand, it is somewhat shorter and, I think, more straightforward and easy to understand.






              share|improve this answer

























                up vote
                0
                down vote













                In writing my other answer, I slowly came around to the idea that the OP's algorithm might be easier to implement with an approach that focuses on the jobs (which can be different), instead of the people (which are all the same). Here's a solution that uses the job-centric approach:



                from collections import Counter
                import numpy as np
                import pandas as pd

                def assignJob(job, assignedix, areasPerPerson):
                for i in range(len(assignedix)):
                if (areasPerPerson - len(assignedix[i])) >= len(job):
                assignedix[i].extend(job)
                return True
                else:
                return False

                def allocatePeople(df, areasPerPerson=3):
                areas = df['Area'].values
                times = pd.to_datetime(df['Time']).values
                peopleUniq = df['Person'].unique()
                npeople = int(np.ceil(areas.size / float(areasPerPerson)))

                # search for repeated areas. Mark them if the next repeat occurs within an hour
                ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
                holds = np.zeros(areas.size, dtype=bool)
                holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

                jobs =
                _jobdict = {}
                for i,(area,hold) in enumerate(zip(areas, holds)):
                if hold:
                _jobdict[area] = job = _jobdict.get(area, ) + [i]
                if len(job)==areasPerPerson:
                jobs.append(_jobdict.pop(area))
                elif area in _jobdict:
                jobs.append(_jobdict.pop(area) + [i])
                else:
                jobs.append([i])
                jobs.sort()

                assignedix = [ for i in range(npeople)]
                for job in jobs:
                if not assignJob(job, assignedix, areasPerPerson):
                # break the job up and try again
                for subjob in ([sj] for sj in job):
                assignJob(subjob, assignedix, areasPerPerson)

                df = df.copy()
                for i,aix in enumerate(assignedix):
                df.loc[aix, 'Person'] = peopleUniq[i]
                return df


                This version of allocatePeople has also been extensively tested and passes all of the same checks described in my other answer.



                It does have more looping than my other solution, so it is likely to be slightly less efficient (though it'll only matter if your dataframe is very large, say 1e6 rows and up). On the other hand, it is somewhat shorter and, I think, more straightforward and easy to understand.






                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  In writing my other answer, I slowly came around to the idea that the OP's algorithm might be easier to implement with an approach that focuses on the jobs (which can be different), instead of the people (which are all the same). Here's a solution that uses the job-centric approach:



                  from collections import Counter
                  import numpy as np
                  import pandas as pd

                  def assignJob(job, assignedix, areasPerPerson):
                  for i in range(len(assignedix)):
                  if (areasPerPerson - len(assignedix[i])) >= len(job):
                  assignedix[i].extend(job)
                  return True
                  else:
                  return False

                  def allocatePeople(df, areasPerPerson=3):
                  areas = df['Area'].values
                  times = pd.to_datetime(df['Time']).values
                  peopleUniq = df['Person'].unique()
                  npeople = int(np.ceil(areas.size / float(areasPerPerson)))

                  # search for repeated areas. Mark them if the next repeat occurs within an hour
                  ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
                  holds = np.zeros(areas.size, dtype=bool)
                  holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

                  jobs =
                  _jobdict = {}
                  for i,(area,hold) in enumerate(zip(areas, holds)):
                  if hold:
                  _jobdict[area] = job = _jobdict.get(area, ) + [i]
                  if len(job)==areasPerPerson:
                  jobs.append(_jobdict.pop(area))
                  elif area in _jobdict:
                  jobs.append(_jobdict.pop(area) + [i])
                  else:
                  jobs.append([i])
                  jobs.sort()

                  assignedix = [ for i in range(npeople)]
                  for job in jobs:
                  if not assignJob(job, assignedix, areasPerPerson):
                  # break the job up and try again
                  for subjob in ([sj] for sj in job):
                  assignJob(subjob, assignedix, areasPerPerson)

                  df = df.copy()
                  for i,aix in enumerate(assignedix):
                  df.loc[aix, 'Person'] = peopleUniq[i]
                  return df


                  This version of allocatePeople has also been extensively tested and passes all of the same checks described in my other answer.



                  It does have more looping than my other solution, so it is likely to be slightly less efficient (though it'll only matter if your dataframe is very large, say 1e6 rows and up). On the other hand, it is somewhat shorter and, I think, more straightforward and easy to understand.






                  share|improve this answer












                  In writing my other answer, I slowly came around to the idea that the OP's algorithm might be easier to implement with an approach that focuses on the jobs (which can be different), instead of the people (which are all the same). Here's a solution that uses the job-centric approach:



                  from collections import Counter
                  import numpy as np
                  import pandas as pd

                  def assignJob(job, assignedix, areasPerPerson):
                  for i in range(len(assignedix)):
                  if (areasPerPerson - len(assignedix[i])) >= len(job):
                  assignedix[i].extend(job)
                  return True
                  else:
                  return False

                  def allocatePeople(df, areasPerPerson=3):
                  areas = df['Area'].values
                  times = pd.to_datetime(df['Time']).values
                  peopleUniq = df['Person'].unique()
                  npeople = int(np.ceil(areas.size / float(areasPerPerson)))

                  # search for repeated areas. Mark them if the next repeat occurs within an hour
                  ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
                  holds = np.zeros(areas.size, dtype=bool)
                  holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

                  jobs =
                  _jobdict = {}
                  for i,(area,hold) in enumerate(zip(areas, holds)):
                  if hold:
                  _jobdict[area] = job = _jobdict.get(area, ) + [i]
                  if len(job)==areasPerPerson:
                  jobs.append(_jobdict.pop(area))
                  elif area in _jobdict:
                  jobs.append(_jobdict.pop(area) + [i])
                  else:
                  jobs.append([i])
                  jobs.sort()

                  assignedix = [ for i in range(npeople)]
                  for job in jobs:
                  if not assignJob(job, assignedix, areasPerPerson):
                  # break the job up and try again
                  for subjob in ([sj] for sj in job):
                  assignJob(subjob, assignedix, areasPerPerson)

                  df = df.copy()
                  for i,aix in enumerate(assignedix):
                  df.loc[aix, 'Person'] = peopleUniq[i]
                  return df


                  This version of allocatePeople has also been extensively tested and passes all of the same checks described in my other answer.



                  It does have more looping than my other solution, so it is likely to be slightly less efficient (though it'll only matter if your dataframe is very large, say 1e6 rows and up). On the other hand, it is somewhat shorter and, I think, more straightforward and easy to understand.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 12 at 21:12









                  tel

                  3,5511427




                  3,5511427






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52730864%2fre-assign-column-values-in-a-pandas-df%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Bressuire

                      Vorschmack

                      Quarantine