Learning Policies for the Dynamic Traveling Maintainer Problem with Alerts

Experiment design

To test the proposed solution approaches, we introduce 16 asset networks, each having a different combination of network size, cost structure and machine degradation matrices. To distinguish the various cases, we use a simple shorthand notation: a network with one machine and degradation matrix Q1 is labeled M1-Q1. Note that when we employ multiple distinct degradation matrices, we concatenate the labels in increasing order, e.g., a network with two machines Q2 and Q3 is labeled M2-Q2Q3. For larger network sizes, we use the same degradation matrix for multiple machines by spreading the degradation matrices evenly across the number of machines. Thus, experiment M4-Q2Q3 represents a network with four machines where half is assigned the degradation matrix Q2 and the other half is assigned the degradation matrix Q3. For each network, we consider the cost structures C1, C2 and C3. This scheme is only changed for the last asset network with M=6, labeled M6-Q2Q3Q4-C. This network increases the complexity by assigning different cost structures to machines. A detailed description per network follows now:

M1-Q1 The simplest setup considers one machine (Q1) such that X_m = N_m, vanishing the information gap. This edge case is introduced to show that the learned policies and heuristics can match the performance or reproduce the policy found via Policy Iteration (PI). We can verify that the optimal policy acts greedily on alerts (under cost structure C1 and C3), or it is reactive with respect to a transition to the failed state (under cost structure C2).
M1-Q4 This setup considers one machine (Q4) where X_m ⊂ N_m,, i.e., there is partial information about the network state space. We quantify the information gap of having information level L₃ over less information L ∈ {L₀, L₁, L₂} when presented with a single machine.
M2-Q2Q3 We scale up the model complexity by increasing the network size. We consider two machines Q2 and Q3, which implies that the decision-maker has partial information about the network state space. Note that each machine in this experiment degrades at a different rate, and ideally, a good policy accounts for this by adapting its response time for each machine.
M4-Q2Q3 Essentially, this network is obtained by doubling the M2-Q2Q3 network, i.e., the decision-maker now faces a four machines network. Note that the network now consists of two machines, Q2 and two machines Q3. This is the maximum network size for which we can still optimize the MDP (under information level L₃) via PI, using our hardware.
M6-Q2Q3Q4 The largest network with six machines is obtained by adding two machines Q4 to the network M4-Q2Q3. The result is a much more challenging problem in which each pair of machines will have different lifetime distributions. Moreover, the alerts of Q4 machines are likely to occur very early relative to the failed asset. This induces a challenging time-based maintenance problem where the decision-maker has to consider the risk of postponing maintenance adequately.
M6-Q2Q3Q4-C Lastly, we increase the model complexity by considering multiple cost structures. Network M6-Q2Q3Q4-C is identical to network M6-Q2Q3Q4; however, we now assign to different machines different cost structures. Q2 machines are assigned cost structure C2, Q3 machines the cost structure C3 and Q4 machines have cost structure C1. This option is more realistic as different assets may have different maintenance cost structures based on type, age, location and utilization.