Finite class distribution

2017-08-03

My wife told me a colleague at school was having a hard time coming up with class distributions for the next year. Having worked on a finite domain solver for over a year that sounded like an interesting problem to tackle with FD. And it was. But it's too big. And I underestimated that slightly.

Class distributions are not easy. There's a few constraints to be taken into account but the sheer number of possibilities makes it a hard problem to solve under the constraints set.

For starters you want an equal distribution of the children over all available classes. While there's a max number of children per class, it doesn't make sense to have one class of 25 while the others have 32 children. For constraint solving, this kind of count based stuff is already a pretty tedious problem.

There are multiple balancing issues at play here. For example, you want to try and balance the gender as much as possible. There are two special classes and you need to balance the number of children that do and do not take those classes such that there's an even distribution between all classes. As a bonus, children that take one of those classes should all be put in one of two classes. This will make future scheduling problems easier.

You want to make try to make sure that children have a "buddy" to cycle home with. Especially for those kids that live further out from school you want to try and put at least two children that live in the same city/village in the same class. We don't want kids cycling home alone in an early winter eve.

Lastly there's friends and foes. To make life on the kids easier there the classes are built to try and put children in the same class as least some friends. At the same time there are some kids you definitely don't want together in a class so you have to make sure they don't end up in the same class. Unless there's absolutely no other way, of course.

In this particular problem there are 117 kids to distribute over 4 classes.

This is a big problem. I mean, big. Much bigger than I initially naively thought it to be.

To illustrate; there are potentially 21403281258520944011158111589102185172033959907819520000000 different ways of constructing the first class. I have no idea how to pronounce that but WolframAlpha tells me that it's about 21 octodecillion. That's not even including constructing the other three classes and mix and matches but also not including any constraints at all. Go read this pdf on permutations for details. However, inside this number it's difficult to get a fix on how many actual valid permutations exist in that space without brute forcing it. Which we're not going to do. I'm not sure whether it's madness to try but it's just beyond the scope of this puzzle.

Okay so the problem is big and we're not going to brute force our way out of it. Let's take a constraint driven approach. Enter a finite domain solver.

First we declare our problem. We have to define our inputs. In this case that is place of residence, gender, special course state, and friends and foes. All in an enumerated sense, of course.

Next we declare our actual constraints. For each kid we'll define a variable that tells us which class(es) it may be put into. For example, a kid that takes the special course could only be put in class 0 or 1 while all other kids may be put in any of the four classes. As FD searches it will try to narrow the allowed classes down for each kid-variable and check whether all constraints hold. Ultimately this should lead to each of these vars having a value 0 through 3 which denominates which class the kid ends up in. Or, of course, the inevitable failure to find this.

Code:
: K0 = [0 4] # Mike
: K1 = [0 4] # Jane
: K2 = [0 1] # Oliver
: K3 = [0 4] # Mary
...


Having those kid-class vars we can generate "is this kid in class k var. This is an unfortunate step creating many hard to eliminate constraints, but I don't see any way around that.

Code:
K0C0 = K0 ==? 0 # is Mike in Class 0?
K0C1 = K0 ==? 1 # is Mike in Class 1?
K0C2 = K0 ==? 2 # is Mike in Class 2?
K0C3 = K0 ==? 3 # is Mike in Class 3?
...
K3C3 = K0 ==? 3 # is Mary in Class 3?
...

In this we can determine at compile time that certain kids would never be part of certain classes, like Oliver in the example above would never be part of the last two classes. We'll try our best not to compile constraints for stuff we already know isn't relevant.

With these kid-class vars we can construct a constraint that limits how many kids are assigned to each class. We'll create a huge sum per class where we count all the kids that have been assigned to that class.

Code:
: C0count, C1count, C2count, C3count [28 32] # limit each class to 28 ~ 32 kids
: C0count = sum(K0C0 K1C0 K2C0 K3C0)
: C1count = sum(K0C1 K1C1 K2C1 K3C1)
: C2count = sum(K0C2 K1C2 K3C2)
: C3count = sum(K0C3 K1C3 K3C3)

As you can see, we can already exclude certain kids from classes they could never be part of. The sum constraint is notoriously difficult to eliminate as the semantic information (the constraint just wants to limit how many kids are assigned to a class) doesn't transfer to the actual constraint being applied (add all these vars together).

We'll basically take a similar approach to gender. Obviously the actual genders are fixed so to get the averages we can reuse the K0C0 kid-class vars from above and sum them per class for one gender only. As a small optimization step we'll want to use the gender that occurs the least since that generates the smallest sums.

Code:
C0boys = sum(K0C0 K2C0) # only add the kids which we already know are male (or female)
...


We take a similar approach to evenly distribute the kids that take one or two special courses. More sums. The approach is the same as above so I won't repeat it.

Next constraint was the place of living. In my data set I had about 16 different places. Little under a half of the kids live in the same place as where the school is at. There are a few places that have only one, two, or three kids. That sucks for them. For the sake of this constraint problem this is great because it drastically limits the number of valid class distributions.

Again we declare each place per kid. The places are enumerated and as a small optimization step the biggest places get 0 and 1. This helps because FD is heavily biased towards solving boolean vars.

Code:
: KP0 0 # Mike lives in Eindhoven
: KP1 0 # Jane lives in Eindhoven
: KP2 2 # Oliver lives in Reusel
: KP3 1 # Mary lives in Veldhoven

We need constraints that bind these kids in groups of at least 2 per class. Obviously kids that are the only one from a certain place can't be taken into account here. One could always manually tweak them to be together with kids from a place on the same route or nearby.

Ignoring the singles, we have three cases to consider; a place with exactly 2 kids, a place with exactly 3 kids, or a place with more than 3 kids. Kids from the first two types (2 or 3 kids in that place) will need to end up in the same class. Note that this may not be possible, but we have to try. I'll get back on these issues later.

Since we know the place for each kid at compile time we can simply constraint the kid-class var.

Code:
# two kids from one place; must be put in same class
K0 == K1 # Mike and Jane must end up in the same class

# three kids from one place; must be put in same class
K0 == K1 # Mike and Jane in same class
K1 == K2 # Jane and Oliver in same class
# (By rule of transitivity this will put K0 K1 and K2 in same class)

# four or more kids from one place
# again, for each class we have to count how many kids from
# some place are put in that class. these sums are smaller tho
# then we just make sure that this sum isn't 1 (because 0 is fine)

# count how many kids from Eindhoven are in the first
# class and make sure that count is never exactly 1
1 != sum(K0C0 K1C0 K15C0 K30C0)
1 != sum(K0C1 K1C1 K15C1 K30C1)
1 != sum(K0C2 K1C2 K15C2 K30C2)
1 != sum(K0C3 K1C3 K15C3 K30C3)
.... same for other places

This will make sure each class will have either two or more kids from the same place, or none at all. Except of course places that only have one kid in the data set.

Lastly we take a look at friends and foes.

Friends can be groups of people that we will want to put together. Groups in my input set range from 2 to 4 kids and the rule is, similar to places, that each kid should have at least one other kid from their group in the same class. So that means each class has 2, 3, or 4 kids from a group. Only groups of 4 could end up split in two classes. This constraint can lead to impossible situations as well.

The constraint approach is similar to places except since we know the upper bound of a group size to be 4, we can hardcode that case as well;

Code:
some(
all?((K0 ==? K1) (K1 ==? K2) (K2 ==? K3)) # all four kids in same class
some?(
all?((K0 ==? K1) (K2 ==? K3)) # A with B and C with D
all?((K0 ==? K2) (K1 ==? K3)) # or A with C and B with D
all?((K0 ==? K4) (K1 ==? K3)) # or A with D and C with B
)
)

Note that this constraint first states "at least one of these must hold" and then "do all of these hold?" together with "does at least one of these hold?". So while some() enforces the constraint, some?() merely reports whether the constraints hold ("a reifier").

This constraint looks complex but compiles pretty nicely.

Foes are fairly straightforward; you add constraints that say kid x can't be in the same as kid y. It doesn't really get more complex than that when declaring the constraint.

Code:
K0 != K2 # don't put Mike together with Oliver because they'll start trouble

This kind of approach will work for pretty much all the foe restrictions and will somewhat limit the options. There generally aren't too many of these though.

That completes our problem! You can find a complete input problem in this gist.

Might be interesting to look at the optimized version in this gist. This file is huge (1.5mb), though, because it's a debug output which collects information about the problem and allows me to analyze and optimize problems in general, improving the library. This is the FD Matrix for me ;) This is the gist of the problem. That is the problem we actually need to solve.

I'm still working on improving the presolve step of FD. That's taking an input problem and reducing it without branching possibilities, like you can see above. But some constraints are NP complete and will always require a certain amount of trial-and-error and hoping to be lucky.

Unfortunately even though the currently presolved problem looks surmountable, this still takes too long in FD. *sad trombone*