Friday, September 1, 2017

Minimizing a Median

\( \def\xorder#1{x_{\left(#1\right)}} \def\xset{\mathbb{X}} \def\xvec{\mathbf{x}} \)A somewhat odd (to me) question was asked on a forum recently. Assume that you have continuous variables $x_{1},\dots,x_{N}$ that are subject to some constraints. For simplicity, I'll just write $\xvec=(x_{1},\dots,x_{N})\in\xset$. I'm going to assume that $\xset$ is compact, and so in particular the $x_{i}$ are bounded. The questioner wanted to know how to model the problem of minimizing the median of the values $x_{1},\dots,x_{N}$. I don't know why that was the goal, but the formulation is mildly interesting and fairly straightforward, with one wrinkle.

The wrinkle has to do with whether $N$ is odd or even. Suppose that we sort the components of some solution $\xvec$, resulting in what is sometimes called the "order statistic": $\xorder 1\le\xorder 2\le\dots\xorder N$. For odd $N$, the median is $$\xorder{\frac{N+1}{2}}.$$For even $N$, it is usually defined as $$\left(\xorder{\frac{N}{2}}+\xorder{\frac{N}{2}+1}\right)/2.$$

The odd case is easier, so we'll start there. Introduce $N$ new binary variables $z_{1},\dots,z_{N}$ and a new continuous variable $y$, which represents the median $x$ value. The objective will be to minimize $y$. In addition to the constraint $\xvec\in\xset$, we use "big-M" constraints to force $y$ to be at least as large as half the sample (rounding "half" up). Those constraints are: \begin{align*} y & \ge x_{i}-M_{i}z_{i},\ i=1,\dots,N\\ \sum_{i=1}^{N}z_{i} & =\frac{N-1}{2} \end{align*} with the $M_{i}$ sufficiently large positive constants. The last constraint forces $z_{i}=0$ for exactly $\frac{N+1}{2}$ of the indices $i$, which in turn forces $y\ge x_{i}$ for $\frac{N+1}{2}$ of the $x_{i}$. Since the objective minimizes $y$, $z_{i}$ will be 0 for the $\frac{N+1}{2}$ smallest of the $x_{i}$ and $y$ will be no larger than the smallest of them. In other words, we are guaranteed that in the optimal solution $y=\xorder{\frac{N+1}{2}}$, i.e., it is the median of the optimal $\xvec$.

If $N$ is even and if we are going to use the standard definition of median, we need twice as many added variables (or at least that's the formulation that comes to mind). In addition to the $z_{i}$, let $w_{1},\dots,w_{N}$ also be binary variables, and replace $y$ with a pair of continuous variables $y_{1}$, $y_{2}$. The objective becomes minimization of their average $(y_{1}+y_{2})/2$ subject to the constraints \begin{align*} y_{1} & \ge x_{i}-M_{i}z_{i}\ \forall i\\ y_{2} & \ge x_{i}-M_{i}w_{i}\ \forall i\\ \sum_{i=1}^{N}z_{i} & =\frac{N}{2}-1\\ \sum_{i=1}^{N}w_{i} & =\frac{N}{2} \end{align*} where the $M_{i}$ are as before. The constraints force $y_{1}$ to be at least as large as $\frac{N}{2}+1$ of the $x_{i}$ and $y_{2}$ to be at least as large as $\frac{N}{2}$ of them. The minimum objective value will occur when $y_{1}=\xorder{\frac{N}{2}+1}$ and $y_{2}=\xorder{\frac{N}{2}}$.

Monday, August 21, 2017

Updated Stepwise Regression Function

Back in 2011, when I was still teaching, I cobbled together some R code to demonstrate stepwise regression using F-tests for variable significance. It was a bit unrefined, not intended for production work, and a few recent comments on that post raised some issues with it. So I've worked up a new and (slightly) improved version of it.

The new version is provided in an R notebook that contains both the stepwise function itself and some demonstration code using it. It does not require an R libraries besides the "base" and "stats" packages. There is at least one butt-ugly hack in it that would keep me from being hired in any sort of programming job, but so far it has passed all the tests I've thrown at it. If you run into issues with it, feel free to use the comment section below to let me know. I'm no longer teaching, though, so be warned that maintenance on this is not my highest priority.

The updated function has a few new features:
  • it returns the final model (as an lm object), which I didn't bother to do in the earlier version;
  • you can specify the initial and full models as either formulas (y~x+z) or strings ("y~x+z"), i.e., quotes are strictly optional; and
  • as with the lm function, it has an optional data = ... argument that allows you to specify a data frame.
There are also a few bug fixes:
  • if you set the alpha-to-enter greater than the alpha-to-leave, which could throw the function into an indefinite loop, the function will now crab at you and return NA
  • if you try to fit a model with more parameters than you have observations, the function will now crab at you and return NA; and
  • the function no longer gets confused (I think) if you happen to pick variable/column names that happen to clash with variable names used inside the function.
As always, the code is provided with a Creative Commons license, as-is, no warranty express or implied, your mileage may vary.

Monday, August 7, 2017

Rolling Horizons

I keep seeing questions posted by people looking for help as they struggle to optimize linear programs (or, worse, integer linear programs) with tens of millions of variables. In my conscious mind, I know that commercial optimizers such as CPLEX allow models that large (at least if you have enough memory) and can often solve them (at least if you have enough computing resources to throw at the problems). My lizard brain, however, was conditioned by the state of the art in the late '70s to send me running (typically while screaming) at the sight of a model with more than, oh, 50 or so variables. Wrapping my head around tens of millions of variables, let alone thinking of a strategy for getting an optimal solution "quickly", is quite a struggle.

A former acquaintance from my student days once articulated his strategy for essay exams to me: if you don't know the answer, argue the premise of the question. In that vein, I'm inclined to ask why so many variables are necessary. One possibility is that the model captures decisions for a sequence of time periods. If decisions in one time period had no impact on subsequent periods, the problem would decompose naturally into a bunch of smaller problems; so if we are talking about a multiperiod model, it's safe to assume that the periods are connected.

That brings me to a strategy I learned back in primitive times, the "rolling horizon" approach. Let me stipulate up front that this is a heuristic method. It does not provide a provably optimal solution for the entire time horizon. Still, given the up-tick in humongous models, I'm starting to wonder if rolling horizons are no longer taught (or are looked down on).

The basic concept is simple. The devil is in the details. Let's say we want to solve a planning model over a horizon of $H$ time periods, and that one omnibus model for the entire horizon is proving intractable. The rolling horizon approach is as follows.
  1. Pick a shorter horizon $K < H$ that is tractable, and a number of periods $F \le K$ to freeze.
  2. Set "boundary conditions" (more on this below).
  3. Solve a model for periods $1,\dots,K$, incorporating the boundary conditions.
  4. Freeze the decisions for periods $1,\dots,F$.
  5. Solve a model for periods $F+1,\dots, \min(F+K, H)$.
  6. Repeat ad nauseam.
Note that if $F<K$, some decisions made in each solution but not frozen will be subject to revision in the next model.

The initial conditions (starting inventory, locations of vehicles, pending orders, ...) for each model are dictated by the state of the system after the last frozen period. The boundary conditions are limits on how much of a mess you can leave at the end of the reduced planning horizon (period $K$ in the first model, $K+F$ in the second model, etc.). More precisely, they limit the terminal state of things that will be initial conditions in the next model.

As a concrete example, consider a manufacturing scheduling model. You start with inventories of components and finished products, available capacity of various kinds, and unfilled orders, and you end with the same kinds of things. Without boundary conditions, your solution for the first model might end period $K$ with no finished goods inventory. Why make stuff if it doesn't count toward your demand within the time frame considered by the model? That might make the problem for periods $K+1$ onward infeasible, though: you have orders that must be filled, they exceed your immediate production capacity, and the cupboard was left bare by the previous solution.

So you want to add constraints of the form "don't leave me with less than this much inventory on hand" or "don't leave me with more than this many unfilled orders". Picking values for those limits is a bit of an art form. Make the limits too draconian and you will get a very suboptimal solution (say, piling up way more inventory than you'll really need) or possibly even make the early subproblems infeasible. Make the limits too laissez faire, and you may force thoroughly suboptimal solutions to later subproblems (piling on lots of expensive overtime, blowing off soon-to-be-former customers) or even make the later problems infeasible. Still, it's usually possible to come up with pretty reasonable boundary conditions, perhaps with some trial and error, and it's a way to get a solution that, if not globally optimal, is at least good (and preferable to staring bleary-eyed at your screen in the 30th hour of a run, wondering if the beast will ever converge).

The term "rolling horizon" was coined in reference to problems that are decomposed based on time, but the same concept may extent to problems that can be decomposed based on position. I used it not long ago to get a heuristic for a problem placing nodes in a physical network. In essence, we took the original geographic region and chopped it into pieces, picked a piece to start from, and then worked our way outward from that piece until all the pieces had been solved, each based on the (frozen) solution to the more inward pieces.

Friday, July 14, 2017

Memory Minimization

As I grow older, I'm starting to forget things (such as all the math I ever learned) ... but that's not the reason for the title of this post.

A somewhat interesting question popped up on Mathematics StackExchange. It combines a basic sequencing problem (ordering the processing of computational tasks) with a single resource constraint (memory) and a min-max criterion (minimize the largest amount of memory consumed anywhere in the sequence). What differentiates the problem from other resource-constrained scheduling problems I've seen is that the output of each task eats a specified amount of memory and must stay there until every successor task that needs it as input completes, after which the memory can be recycled. In particular, the output of the last task using the result of task $n$ will coexist momentarily with the output of $n$, before $n$'s output can be sent to the bit-recycler. Also, there is no time dimension here (we neither know nor care how much CPU time tasks need), just sequencing.

The data for the problem is a precedence graph (directed acyclic graph, one node for each task) plus an integer weight $w_n$ for each task $n$, representing the memory consumption of the value computed by task $n$. There are several ways to attack a problem like this, including (mixed) integer programming (MIP), constraint programming (CP), and all sorts of metaheuristics. MIP and CP guarantee optimal solutions given enough resources (time, memory, computing power). Metaheuristics do not guarantee optimality, but also do not require commercial software to implement and may scale better than MIP and possibly CP.

I thought I'd publish my MIP model for the problem, scaling be damned.

Index sets and functions

Let $N$ be the number of tasks (nodes).
  • I will use $V=\left\{ 1,\dots,N\right\}$ as the set of tasks and $S=\left\{ 1,\dots,N\right\}$ as the set of schedule slots. (Yes, I know they are the same set, but the dual notation may help distinguish things in context.)
  • $A\subset V\times V$ will represent the set of arcs, where $(u,v)\in A$ means that task $v$ takes the output of task $u$ as an input.
  • $\pi:V\rightarrow2^{V}$ and $\sigma:V\rightarrow2^{V}$ will map each task to its immediate predecessors (the inputs to that task) and immediate successors (tasks that need its value as inputs) respectively.


The only parameter is $w:V\rightarrow\mathbb{N}$, which maps tasks to their memory footprints.


There are a few ways to express assignments in MIP models, the two most common of which are "does this critter go in this slot" and "does this critter immediately follow this other critter". I'll use the former.
  • $x_{ij}\in\left\{ 0,1\right\} \ \forall i\in V,\forall j\in S$ will designate assignments, taking the value 1 if task $i$ is scheduled into slot $j$ in the sequence and 0 otherwise.
  • $r_{ij}\in\left[0,1\right]\ \forall i\in V,\forall j\in S$ will take the value 1 if the output of task $i$ is resident in memory immediately prior to performing the computational task in slot $j$ and 0 otherwise.
  • $d_{ij}\in\left[0,1\right]\ \forall i\in V,\forall j\in S$ indicates whether all tasks requiring the output of task $i$ have been completed prior to the execution of the task in slot $j$ (1) or not (0).
  • $z>0$ is the maximum memory usage anywhere in the schedule.
We can actually winnow a few of these variables. If task $i$ has any predecessors, then clearly $x_{i1}=0$, since there is no opportunity before the first slot to execute a predecessor task. Similarly, $d_{i1}=0\,\forall i\in V$, since nothing has been completed prior to the first slot.

You may think at this point that you have spotted a typo in the second and third bullet points (brackets rather than braces). You haven't. Read on.


The easiest part of the model is the objective function: minimize $z$.


The assignment constraints are standard. Every task is assigned to exactly one slot, and every slot is filled with exactly one task. $$\sum_{j\in S}x_{ij}=1\quad\forall i\in V$$ $$\sum_{i\in V}x_{ij}=1\quad\forall j\in S$$ Precedence constraints are also fairly straightforward. No task can be scheduled before each of its predecessors has completed.$$x_{ij}\le\sum_{k <j}x_{hk}\quad\forall i\in V,\forall j\in S,\forall h\in\pi(i)$$ Next, we define memory consumption. Immediately after executing the task in any slot $j$, the memory footprint will be the combined consumption of the output of that task and the output of any earlier task still in memory when slot $j$ executes.$$z\ge\sum_{i\in V}w(i)\left(r_{ij}+x_{ij}\right)\quad\forall j\in S$$ Defining whether the output of some task $i$ is no longer needed prior to executing the task in some slot $j>1$ is fairly straightforward.$$d_{ij}\le\sum_{k<j}x_{hk}\quad\forall i\in V,\forall 1<j\in S,\forall h\in\sigma(i)$$As noted above, there is no "before slot 1", so we start this set of constraints with $j=2$.

Finally, the result of task $i$ remains in memory at the point where the task in slot $j$ executes if task $i$ executed before slot $j$ and the memory has not yet been freed.$$ r_{ij}\ge\sum_{k<j}x_{ik}-d_{ij}\quad\forall i\in V,\forall 1<j\in S$$Again, this is nonsensical for $j=1$, so we start with $j=2$.


I'll end with a few observations about the model.
  • It has $O(N^2)$ binary variables, so, as I noted above, it will not scale all that gracefully.
  • As mentioned early, the relaxation of $d_{ij}$ and $r_{ij}$ from binary to continuous in the interval $[0,1]$ was deliberate, not accidental. Why is this justified? $d_{ij}$ is bounded above by an integer quantity, and the solver "wants" it to be $1$: the sooner something can be removed from memory, the better for the solution. Similarly, $r_{ij}$ is bounded below by an integer expression, and the solver "wants" it to be $0$: the less cruft hanging around in memory, the better. So the solution will be binary even if the variables are not.
  • The solution might be "lazy" about clearing memory. From the model perspective, there is no rush to remove unneeded results as long as they are not contributing to the maximum memory footprint. The maximum memory load $z$ may occur at multiple points in the schedule. At least one would contain nothing that could be removed (else the solution would not be optimal). At points $j$ where memory use is below maximum, and possibly at points where memory use equals the maximum, there may be tasks $i$ for which $d_{ij}=0$ in the solution but could be set to $1$, without however reducing $z$. This can be avoided by adding constraints that force $d_{ij}=1$ wherever possible, but those constraints would enlarge the model (quite possibly slowing the solver) without improving the final solution.

Friday, June 23, 2017

Premature Obituaries

[T]he report of my death was an exaggeration. (Mark Twain, 1897)
In a recent blog post, "Data Science Is Not Dead", Jean-Francois Puget discussed and dissented with a post by Jeroen ter Heerdt titled "Data Science is dead." Barring the possibility that Schroedinger shoved data science into a box and sealed it, both assertions cannot simultaneously be true. The central thesis of ter Heerdt's post is that data scientists have developed powerful and easy to use tools, now deployed on cloud services, that let lay people do the analyses that previously required said data scientists, in effect putting themselves out of business. Puget responds that "there is a very long tail of use cases where developing a packaged service isn't worth it", and draws parallels to operations research. Common and important problems such as routing, machine scheduling, crew scheduling and so on have led to the development of problem-specific commercial software, but "many companies still have an OR department (under that name, or as part of their data science department) because they have operations problems that cannot be optimized with off the shelf software of services".

I'm siding with Puget on this, having experienced the "demise" of management science (if not all of OR) some decades ago. When I started on the faculty at Michigan State University, management science (whether under that name, "operations research" or something else) was a common and important element of business school programs. We had mandatory core courses in M.S. at both the undergraduate and masters levels, as well as higher level "elective" courses that were de facto requirements for some doctoral concentrations. We also participated in an interdisciplinary OR program at the masters level.

Gradually (but not gradually enough for me), MS evaporated at MSU (a bit ironic given the respective acronyms). Some of the more applied subject matter was moved into "functional area" courses (production planning, marketing, finance); most of the more conceptual subject matter just went away. As previously noted, canned software began to be available to solve many of the problems. The perceived need shifted from someone who understood algorithms to someone who could model the problem well enough to generate the inputs for the software.

As Puget notes, there is still demand for OR/MS professionals because there are new problems to be recognized and modeled, and models to be solved that do not fit neatly into the canned software. I believe there is also another reason not to start shoveling dirt on the grave of OR/MS. Users who learned a few basic incantations in a functional area class, without learning how the magic works (or does not work), may misapply techniques or produce incorrect models. Those who learn OR/MS (or analytics) as incantations may also tend to be a bit too literal-minded.

A possible analogy is the difference between a chef and someone like me who equates "cook" with "microwave on high". A chef understands a recipe as a framework, to be adjusted as needed. You couldn't get the cut of meat called for? Switch to this meat, then adjust this spice and cook a bit longer. On the (exceedingly rare) occasions I actually prepare a dish, I follow the instructions slavishly and try not to freelance at all.

Along those lines, I attended a thesis proposal defense (for a student not trained in OR/MS) where the work involved delivery routing and included a variant of a traveling salesman model. Both the candidate and his committee took it as axiomatic that a vehicle could not pass through the same node in the routing graph twice because, well, that's part of the definition of the TSP. So I posed the following simple question. You have a warehouse W and two customers A and B, all on the same street, with W between A and B. A is in a cul de sac, so the network diagram looks like

A -- W -- B -- whatever

with any number of links to B but only the A-W edge incident A (and only A-W and B-W incident on W). Trivial exercise: prove that, under the strict definition of a TSP, A and B cannot be on the same route, no matter how close they are to W (and each other).

My point here was not to bust the student's chops. An OR "chef" building a network model for truck deliveries would (hopefully) recognize that the arcs should represent the best (shortest, fastest, cheapest ...) way to travel between any pair of nodes, and not just physical roads. So, in the above example, there should be arcs between A and B that represent going "by way of" W. It's fine to say that you will stop at any destination exactly once, but I know of only two reasons why one would route a vehicle with a requirement that it pass through any location at most once: it's either laying land mines or dribbling radioactive waste behind it. Hopefully neither applied in the student's case.

The student, on the other hand, could not see past "arc = physical roadway" ... nor was he the only one. After the defense, the student contacted two companies that produce routing software for the trucking industry. According to him (and I'll have to take his word for this), neither of them had given any thought to the possibility that passing through the same node twice might be optimal, or to creating "meta-arcs" that represent best available routes rather than just direct physical links. If true, it serves as a reminder that canned software is not always correct software. Caveat emptor.

The flip side of being too "cookie cutter" in modeling is being too unstructured. As an instructor, I cringed at students (and authors) who thought a linear regression model could be applied anywhere, or that it was appropriate to pick any variable, even a categorical variable, as the dependent variable for a linear regression. To me, at least, neural networks and software services sold as "machine learning" are more opaque than regression models. Having someone with modest if any data science training cramming whatever data is at hand into one end of a statistical black box and accepting whatever comes out the other end does not bode well.

So, in addition to Puget's "tail problems", I think there will always be value in having some people trained at the "chef" level, whether it be in OR/MS or data science, working alongside the "line cooks".

Update: I just came across an article published by CIO, "The hidden risk of blind trust in AI’s ‘black box’", that I think supports the contention that companies will need data scientists despite the availability of machine learning tools. The gist is that not understanding how the AI tool arrives at its conclusions or predictions can creative some hazard, particularly in regulated industries. For instance, if there is some question about whether a company discriminates illegally in providing mortgages or loans, "the machine said" may not be an adequate defense.

Monday, May 29, 2017

If This and That Then Whatever

I was asked a question that reduced to the following: if $x$, $y$ and $z$ are all binary variables, how do we handle (in an integer programming model) the requirement "if $x=1$ and $y=1$ then $z=1$"? In the absence of any constraints on $z$ when the antecedent is not true, this is very easy: add the constraint $$z \ge x + y - 1.$$Verification (by substituting all possible combinations of 0 and 1 for the variables) is left to the reader as an exercise.

I thought I had covered this in a previous post, but looking back it appears that it never came up (at least in this form). This might be my shortest post ever. :-)

Sunday, May 28, 2017

Update Error: Wrong Architecture

Yesterday I ran into one of those mystifying glitches that, will infrequent, serve as a reminder that Linux is not for the faint of heart. When I booted my desktop system (Linux Mint 18.1 Serena), the system tray icon for the software updater was displaying a red "X" that indicates it tried and failed to update its cache. This is not uncommon, and the solution is usually simple: open the application and click the update button.

This time, however, the update produced a slew of error messages. For what appeared to be each repository it checked, it printed an error message saying
Skipping acquire of configured file <something with "i686" in it> as repository <whatever> doesn't support architecture 'i686'
After all the architecture error messages, I also got one about a problem with a GPG encryption key on a CRAN mirror. That message had been popping up, harmlessly, on each update for quite a while, so I did not worry about it.

The PC is a 64-bit machine, with an AMD processor, and I'm used to architecture references that contain either "amd64" or "x86_64" rather than "i686". So my suspicious was that something had messed up a setting somewhere on the machine that identified the architecture to the updater. Running uname -m in a terminal got me "x86_64", confirming that the machine itself was not undergoing any sort of identity crisis.

After a fruitless search for a malformed configuration file, I went to a Mint help forum, where I received a few answers. One said it was a server error, and I should just wait for them to fix it. That didn't hold water for a couple of reasons. First, it would require simultaneous problems on a whole bunch of unrelated repository servers, which would be a bit too coincidental. Second, my laptop (also using an AMD 64-bit chip, but running Mint 18.0 Sarah and a slightly older version of the updater) had no problem updating. Another suggestion, reinstalling the updater, also bore no fruit.

Along the way, I discovered that in fact two repositories were being updated correctly. Those two, not coincidentally, specified "[arch=amd64]" in their source listings. Adding that string to all the other PPAs and extra repositories worked, albeit at some cost (the updater started to label the repositories as "arch=amd64" rather than by their correct names). Unfortunately, the updater would not let me use that trick on the two main repositories (one for Mint, one for Ubuntu), which left me still looking for a fix.

Someone on the help forum suggested disabling all the PPAs on my list of sources and trying again. I did that, and (crucially) also disabled sources in the "Additional repositories" list, leaving just the Mint and Ubuntu repositories. Lo and behold, the update went through. So I started restoring the other repositories in a binary search, and found the culprit. Remember the CRAN mirror that had a problem with its encryption key? If I disabled just that source, which was in the "Additional repositories" list, things worked. So I switched mirrors, reloaded the public key, and things are back to normal.

I have no idea why the encryption key error, which I'd been getting for weeks if not months, suddenly caused things to go splat across the board -- and even less intuition as to why a bad encryption key would cause the updater to start using the wrong architecture code. For now, though, I'll settle for having updates working again.

Update 06/28/17:  It happened again, this time with the Revolution Analytics CRAN mirror yielding the message "server certificate verification failed" and something about a missing file (which I assume was their certificate). So I switched to yet another mirror, and we're back in business ... for the moment. I don't like the idea of downloading software over insecure/unencrypted connections (for reasons that I would hope are obvious), but this is getting tedious. Note to server admins: please keep on top of your certificate deployment.

Update 07/25/17: Here we go again, only this time with a new wrinkle. I added a PPA, did an update to the package list, and got the 'i686' architecture message for every repository. This time, though, there were no certificate errors, and disabling every source I could (including the newly added PPA) did not help. So something else was at fault. After searching the web for help, I tried the following commands in a terminal:

dpkg --print-architecture
dpkg --print-foreign-architectures

The first one listed 'amd64' (which is correct); the second listed 'i386' and 'i686'. I have one or more packages installed that actually do require the 'i386' architecture -- packages that only exist in 32 bit form (don't ask me which, I've forgotten) -- but none that I know of requiring 'i686'. So, again in a terminal, I ran

sudo dpkg --remove-architecture i686

and (after restoring all my repos) was able to update successfully. I don't know if adding the new PPA caused the 'i686' architecture to return, zombie-like, for another go around, or whether the package manager just wants to reset to that whenever I'm not looking. In any event, I think I have a fix now, but we'll see if it's permanent or temporary. 

Thursday, April 20, 2017

Sharing "Implicit" Test Problems

The topic of reproducible research is garnering a lot of attention these days. I'm not sure there is a 100% agreed upon, specific, detailed definition of it, and I do think it's likely to be somewhat dependent on the type of research, but for purposes of this post the Wikipedia definition (previous link) works for me. In particular, I want to call out one part of the Wikipedia entry:
The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the laboratory notebooks and full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.
I've seen a lot of press related to reproducibility in the biological sciences, although I assume it's an issue in other sciences as well. (Good luck with that, physicists!) It's also increasingly an issue in the social sciences, where one study asserted that over half of the published results they tested failed to replicate. (That study itself was criticized as allegedly failing to replicate.) All of this has been termed by some a "replication crisis". In short, reproducibility is a big deal. There's at least one blog devoted to it, and a considerable amount of work in the statistics community is going into tools to facilitate reproducibility (such as R and Python "notebooks"). R users in particular should have a look at the rOpenSci Reproducibility Guide.

My research tends almost exclusively to fall into the category of "stupid math programming tricks". Either I'm trying to find some clever formulation of a (deterministic) problem, I'm trying to find an efficient way to generate an exact solution, I'm trying to find a decent heuristic for getting "good" solutions "quickly" (preferably without stretching analogies to nature too far: there's no way I'm beating slime mold optimization in the naming contest), or some combination of the above. Since I mostly avoid statistics (other than the occasional comparison of run times with some selected benchmark alternative), I've been largely unaffected by the debate on reproducibility ... until now.

Recently, the journals published by INFORMS have moved in the direction of reproducible research, and I suspect others are doing so (or will do so in the near future) as well. As an example relevant to my interests, the INFORMS Journal on Computing (JOC) has introduced policies on the sharing of both data and code. Personally, I think this is a good idea. I've shared data and code for one particular paper (machine scheduling) on my web site since the paper came out (20+ years ago), and I share data for a more recent paper as well (having released the code as an open source project).

At the same time, I recognize that there are various difficulties (licensing/copyright, for instance) in doing so, and also there are costs for the researcher. I'd be willing to share the code for one or two other projects, but it would take a major effort on my part to untangle the spaghetti and figure out which libraries are/are not required, and which code should be included or should be excluded as irrelevant to the ultimate experiments. I'm reluctant to commit that time until someone actually requests it.

There's another twist that I would not have anticipated until quite recently. I'm working on a project that involves "implicit hitting set" (IHS) problems. In an IHS problem, you have a master problem that formulates as a pure integer (in fact, 0-1) program. Candidate solutions to that model are fed to one or more "oracles", which either bless the solutions as feasible or generate additional constraints for the master problem that the candidates violate. Note that I have not said anything about the nature of the oracles. Oracles might be solving linear or integer programming models, which would be relatively easy to share as "data", but they might also be doing something decidedly funky that is encapsulated in their coding. In the latter case, the test "data" is actually code: I would have to provided the code for the oracles in order for someone to reproduce my results.

Well, okay, if sharing code is on the table now, isn't that just more code? Not quite. Let's say that some unfortunate doctoral student has been tasked by her advisor to code their shiny new IHS algorithm and test it against my published problems. The doctoral student used Python or Julia (or some other trendy language), whereas I used stodgy old Java, so there's a good chance the luckless doctoral student will have to recode my stuff (which, among other things, requires making sense of it). Moreover, I will have created an API to my oracles that may or may not fit with what they are doing ... and that's if I used an API at all. If I directly integrated program structures external to the oracle functions into the oracle (passed in variables, constraints or other elements of my master problems, for instance), our doctoral student will need to figure out how to isolate those elements and replace them with corresponding elements from her code ... if there is even a correspondence.

There's at least one implication in this for me. If I actually subscribe to the push for reproducibility (or if I take pity on other people's doctoral students), I need to code my oracles not as an integral part of my master code but as a separate software module or library with a well-documented and reasonably flexible interface to the outside world (API). <sigh>More work for me.</sigh>

Tuesday, March 28, 2017

Adventures with TeX Live

Since I use the Linux Mint operating system, the obvious (if not only) choice for a LaTeX distribution is TeX Live. (If you are not familiar with, or are not interested in, the LaTeX typesetting system, you have already read too far in this post.) On Mint, Ubuntu and other Debian-type operating systems, you typically install TeX Live by downloading and installing a number of fairly large files from a repository (in my case, mainly repositories for Ubuntu). Sometimes, though, you just want to install one or two small LaTeX packages without bothering with one of the large TeX Live files (which will contain a zillion other LaTeX packages you do not want or need). For this purpose, TeX Live includes a package manager (tlmgr).

Today, I found myself trying unsuccessfully to install something using tlmgr. The first problem was a missing compression/decompression program that was easy to install. After that, I continued to get cryptic error messages. Eventually, I decided to update tlmgr to see if that would help.

The Canonical (Ubuntu) repositories are fairly far behind the curve when it comes to TeX Live; they still provide the 2015 version, which is incompatible with newer versions (including the newer version of tlmgr). So updating tlmgr alone was not an option. (I had what appeared to be the most recent, and presumably final, incarnation of the 2015 version of tlmgr.) Following instructions I found on the Tips on Ubuntu site, I added a PPA containing the latest (I think) version of TeX Live, running the following in a shell:

sudo add-apt-repository ppa:jonathonf/texlive
sudo apt update 

After that, I used the software updater to update my TeX Live packages. (The "Tips on Ubuntu" instructions have you install TeX Live from the PPA, which presumes you do not already have a version installed, as I did.)

Before using tlmgr to do much of anything, you need to initialize the user tree by running

tlmgr init-usertree

in a shell. I had already done that, but I repeated it, confirming that the user tree was still intact. I then ran

tlmgr --gui

to open a graphical user interface, told it to load the default repository (there's a button in the "Repository" panel of the display), and tried -- unsuccessfully -- to install a LaTeX package.

So what was wrong now?  The error message contained the following snippet:

Tk::Error: /usr/bin/tlmgr: open(/usr/share/texlive/tlpkg/texlive.tlpdb)
failed: No such file or directory ...

I checked /usr/share/texlive/tlpkg, and sure enough there was no file texlive.tlpdb ... but there was a file named texlive.tlpdb.9ed73e8174b03a21aa0d40ebbcaac97f. It's a text file (around 12 MB) with configuration information and what looks like a directory structure. Why it had the rather funky hexadecimal extension, I have no idea. I symlinked it by running

sudo ln -s texlive.tlpdb.9ed73e8174b03a21aa0d40ebbcaac97f texlive.tlpdb

and tlmgr started behaving itself.

I later deleted the file with the scary looking hex extension, reran tlmgr and had it load the default repository again. It reproduced the file, but with a different (equally scary) hex extension. I again symlinked the file to texlive.tlpdb.

I have no idea why loading a repository produces what appears to be the correct file with the wrong extension and does not symlink it, but at least this fix seems to work.

Monday, February 27, 2017

MSU Students Do Microfinance

Several years ago, I found out about, an online "microfinance" site where individuals can make small loans ($25 is the standard increment at Kiva) to entrepreneurs in low income settings. The entrepreneurs actually apply for larger amounts, which they typically receive from third-party "field partners". The entrepreneurs repay the loans with interest to the field partners, who in turn repay the principal (no interest) to the original donors. A modest deposit by a donor can be turned over repeatedly into loans, enjoying a version of what in my long-ago undergrad econ class was termed the "multiplier effect". I've made a fair number of loans over the years, nearly all of which have been fully repaid. It's a bit of fun as well as personally satisfying.

So I was geeked to learn from one of my colleagues at the Broad College of Business, Professor Paulette Stenzel, that we have a student organization on campus dedicated to microfinance. The Spartan Global Development Fund is a 501(c)(3) nonprofit organization, created and run by students, that makes microloans both through their own field partners and through Kiva. It's a registered student organization, meaning that any Michigan State University student can join. Students and nonstudents alike can donate through PayPal.

Next time I'm on the Kiva site, I plan to join their Kiva "team", and I encourage anyone interested in microfinance to check out their site.

Thursday, February 23, 2017

Another Absolute Value Question

Someone asked whether it is possible to eliminate the absolute value from the constraint$$Lx\le |y| \le Ux$$(where $L\ge 0$ and $U>0$ are constants, $x$ is a binary variable, and $y$ is a continuous variable). The answer is yes, but what it takes depends on whether $L=0$ or $L>0$.

The easy case is when $L=0$. There are two possibilities for the domain of $y$: either $y\in [0,0]$ (if $x=0$) or $y\in [-U,U]$ (if $x=1$). One binary variable is enough to capture two choices, so we don't need any new variables. We just need to rewrite the constraint as$$-Ux\le y \le Ux.$$If $L>0$, then there are three possibilities for the domain of $y$: $[-U, -L] \cup [0,0] \cup [L, U]$. That means we'll need at least two binary variables to cover all choices. Rather than change the interpretation of $x$ (which may be used elsewhere in the questioner's model), I'll introduce two new binary variables ($z_1$ for the left interval and $z_2$ for the right interval) and link them to $x$ via $x=z_1+z_2$. That leads to the following constraints:$$-Uz_1 \le y \le -Lz_1 + U(1-z_1)$$and$$Lz_2 - U(1-z_2) \le y \le Uz_2.$$ If $x=0$, both $z_1$ and $z_2$ must be 0, and the new constraints force $y=0$. If $x=1$, then either $z_1=1$ or $z_2=1$ (but not both). Left to the reader as an exercise: verify that $z_1=1\implies -U\le y \le -L$ while $z_2=1 \implies L\le y\le U$.

Saturday, January 21, 2017

Fischetti on Benders Decomposition

I just came across slides for a presentation that Matteo Fischetti (University of Padova) gave at the Lunteren Conference on the Mathematics of Operations Research a few days ago. Matteo is both expert at and dare I say an advocate of Benders decomposition. I use Benders decomposition (or variants of it) rather extensively in my research, so it ends up being a frequent theme in my posts. Those posts tend to generate more comments than posts on other topics. Apparently Matteo and I are not the only BD users out there.

I don't know that I would recommend Matteo's presentation as the starting point for someone who has heard of BD but never used it, but I certainly recommend having a look at the slides for anyone who has any familiarity with BD. Matteo provides several interesting perspectives as well as a tip or two for potentially improving performance. I learned a few new things.

In a sad coincidence, Professor Jacques Benders, the originator of Benders decomposition, passed away at age 92 just eight days before Matteo's presentation.

Tuesday, January 10, 2017

Pro Bono Analytics Is Growing Social

Pro Bono Analytics is a program by INFORMS (the Institute for Operations Research and the Management Sciences, for the acronym-averse), "the largest society in the world for professionals in the field of operations research (O.R.), management science, and analytics". PBA "connects our members and other analytics professionals with nonprofit organizations working in underserved and developing communities". In other words, we hook up charitable organizations needing analytics or OR help with volunteers willing to provide it without compensation. Our volunteers are a mix of industry practitioners, academics, students and the occasional geezer retiree. I think the majority of the volunteer pool comes from the U. S., and a majority are INFORMS members, but we have volunteers from as far away as Australia, and a significant portion are non-members.

I'd encourage anyone with OR/MS/analytics skills to consider volunteering, and anyone who knows a charitable organization (particularly those of limited financial means) to let them know we're out there willing to extend a helping hand. Both potential clients and potential volunteers can find out more, and signify their interest, at the PBA home page (repeating the link above).

Our new (and apparently energetic) staff liaison, Tia Carrai, has expanded PBA's social media footprint. You can find us now on the following:
I'd also like to give a shout-out to Pro Bono O.R., a like-minded initiative by our sister institution in the U. K., the Operational Research Society.

Friday, January 6, 2017

Mapping Trackball Buttons

For years, I've used a Logitech M570 wireless trackball with my Linux Mint PC. I generally prefer trackballs to mice -- no need to lift and reposition after a bunch of movement -- and I find that using my thumb, rather than my index finger (or, if I'm in a bad mood, my middle finger) to move the cursor is less fatiguing for my hand and wrist. The M570 works fine with Linux (at least Mint and Ubuntu) with no need for additional drivers.

In addition to the usual (at least for non-Mac users) three main buttons (left, write and combination scroll wheel/third button), the M570 has a couple of secondary buttons, which Logitech describes as "large, easy-to-reach Back/Forward buttons". I'm not sure I'd agree about "large", but I agree that they are easy to reach, and up until my most recent operating system upgrade I would have agreed they acted as back and forward buttons. Prior to the upgrade (and with no special configuration on my part, at least that I can remember), the extra buttons acted like page up and page down in every web browser or document reader that I used. After the upgrade (and switch from the Cinnamon desktop to the Mate desktop), their behavior changed. In the Firefox web browser, they would switch among tabs rather than vertically scrolling the current tab. In Xreader and Acrobat Reader, they also did not move forward backward among pages. (I don't recall trying to read multiple documents at once to see if they would switch among open documents.)

I find the forward/backward action rather handy with multipage documents and long web pages, so I wanted the previous behavior back. It turned out (after some research) not to be hard to do.

The first step is to install the xbindkeys and xautomation packages, both available from the Canonical repositories. This can be done via Synaptic or by running
     sudo apt-get install xbindkeys
     sudo apt-get install xautomation
in a terminal. (I also installed xbindkeys-config, which xbindkeys "suggests", but I don't think it's really necessary.) The xautomation package provides a command xte that can fake a key press.

The second step is to determine what button numbers are assigned to the forward and backward buttons. Run
     xev | grep -i button
in a terminal. This will open a small window with a target. Position the cursor over that window and click each of the buttons. You will see two messages in the terminal for each button, one for the press and one for the release. Look for the button numbers. For me, they were button 8 for forward and button 9 for backward, but your mileage may vary.

The third step is to configure xbindkeys to translate the extra trackball buttons appropriately. The configuration for xbindkeys is kept in a plain text file named .xbindkeysrc in your home directory (~/.xbindkeysrc). You can either create a new one (if you don't have one yet, or if there is nothing in it you want to keep) containing the lines below, or append those lines to the existing file, using your choice of text editor.
# Key bindings for Logitech M570 trackball
"xte 'key Page_Up'"
"xte 'key Page_Down'"
If your button numbers differ from mine (8 and 9), edit the lines accordingly. Note carefully the single and double quotation marks; xte seems a bit finicky about them.

Finally, you'll want to test this. You need to restart xbindkeys to get it to read the modified configuration file. Theoretically you can do this by running
     killall -HUP xbindkeys
in a terminal, but I found it necessary to restart the X server (after closing any applications using it, such as my web browser and email client). Typically Control+Alt+Backspace will do the trick. After that, try the trackball keys and hopefully they'll behave as expected.