Fig 1.
Next-best action-based personalized decision-making in constrained, tailored, sequential and interactive dynamic processes with state-action-response-coupled sequences.
A decision-maker interacts with a client sequentially at t time steps. At each time point i, the client is associated with his demographics di and state si. The decision-maker takes an action ai on the client’s state si under policy constraint pi. The client responds to the action with his behavior oi, and the undertaken action takes effect with reward ri showing the transform effect from the current state to the next state. Note, the diagram only illustrates an ideal scenario: one client who interacts with one decision-maker, and one action corresponds to one response at each time point. From bottom to top: a rectangle represents a client’s demographics, a light-blue ellipse represents a client’s response, a circle represents a client’s states, a light-red ellipse represents decision actions, and a rounded rectangle represents policies constraining decision actions.
Fig 2.
The framework for modeling the next-best action-oriented personalized decision-making.
Ct refers to the representation describing client c at time t, st is the vector of the client’s state representation, at refers to an action selected from the candidate action set , at refers to the vector representation of action at, and
is the set of recommended next-best actions. The recommender first embeds a client’s demographics, behaviors and current state to a state vector st by the personalized representation module (Fig 3), then feeds st and at into the reward prediction module to evaluate the effectiveness of the action. The actions in the candidate set with the top-k highest rewards are then recommended as the next-best actions.
Fig 3.
A reinforced coupled recurrent network to learn personalized client representation.
Given a client c at current time t with the description Ct, Ot,i refers to the client response at past time i, ai is the decision action assigned to the client, oi represents the vector representation of Ot,i, ai represents the vector representation of ai, represents the hidden state corresponding to the action,
represents the hidden state corresponding to the client response, d is the transformed vector corresponding to the client’s relatively stable personal information Dt, simp indicates the learned data-driven implicit features, sexp refers to the transformed domain-driven explicit features, st is the resultant state vector representation for the client c, and FC refers to fully connected networks.
Fig 4.
A coupled recurrent unit (CRU) for modeling state-action-response-coupled long-term dependencies.
and
refer to the representation vectors of the historical sequences of actions and client responses, respectively. ro and ra are two gates to control the impact of historical responses and actions on their current states. Gates zo and za control the impact of current response and action states on updating the memory of their historical information respectively. ri is an interaction gate to capture the dependence between a decision action and a client response.
Fig 5.
An example of representing three clients by the reinforced coupled recurrent network.
Three debtors with different demographics and past response behaviors to the same decision actions are represented in three vectors.
Fig 6.
Reward prediction for the next-best action on a client’s state.
The reward (rating) of an action is predicted by residual networks corresponding to a client’s state.
Table 1.
The distribution of rewards to 10 actions specified by debt collection experts.
Table 2.
Average reward lift for 10 actions recommended by 11 deep models over the review measured by domain-driven debt collection rules.
Table 3.
Precision lift of 10 actions recommended by 11 deep models over that by domain-driven debt collection rules.
Table 4.
The reward mean squared error (MSE) per action between the reward made by the domain-driven debt collection rules and that recommended by 10 deep models.
Fig 7.
CRN convergence w.r.t. loss value on the validation debt collection data.
The X-axis refers to the number of epochs, and the Y-axis refers to the loss value of the CRN objective function (Eq (3)).