machina.algos package

Submodules

machina.algos.airl module

machina.algos.behavior_clone module

machina.algos.ddpg module

This is implementation of Deep Deterministic Policy Gradient. See https://arxiv.org/abs/1509.02971

machina.algos.ddpg.train(traj, pol, targ_pol, qf, targ_qf, optim_pol, optim_qf, epoch, batch_size, tau, gamma)[source]

Train function for deep deterministic policy gradient

Parameters:
  • traj (Traj) – Off policy trajectory.
  • pol (Pol) – Policy.
  • targ_pol (Pol) – Target Policy.
  • qf (SAVfunction) – Q function.
  • targ_qf (SAVfunction) – Target Q function.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • optim_qf (torch.optim.Optimizer) – Optimizer for Q function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • tau (float) – Target updating rate.
  • gamma (float) – Discounting rate.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.diayn module

machina.algos.diayn_sac module

machina.algos.gail module

machina.algos.mpc module

machina.algos.on_pol_teacher_distill module

machina.algos.ppo_clip module

This is an implementation of Proximal Policy Optimization in which gradient is clipped by the size especially. See https://arxiv.org/abs/1707.06347

machina.algos.ppo_clip.train(traj, pol, vf, optim_pol, optim_vf, epoch, batch_size, num_epi_per_seq=1, clip_param=0.2, ent_beta=0.001, max_grad_norm=0.5, clip_vfunc=False)[source]

Train function for proximal policy optimization (clip).

Parameters:
  • traj (Traj) – On policy trajectory.
  • pol (Pol) – Policy.
  • vf (SVfunction) – V function.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • optim_vf (torch.optim.Optimizer) – Optimizer for V function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • num_epi_per_seq (int) – Number of episodes in one sequence for rnn.
  • clip_param (float) – Clipping ratio of objective function.
  • ent_beta (float) – Entropy coefficient.
  • max_grad_norm (float) – Maximum gradient norm.
  • clip_vfunc (bool) – If True, vfunc is also updated by clipped objective function.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.ppo_clip.update_pol(pol, optim_pol, batch, clip_param, ent_beta, max_grad_norm)[source]

Update function for Policy.

Parameters:
  • pol (Pol) – Policy.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • batch (dict) – Batch of trajectory
  • clip_param (float) – Clipping ratio of objective function.
  • ent_beta (float) – Entropy coefficient.
  • max_grad_norm (float) – Maximum gradient norm.
Returns:

pol_loss – Value of loss function.

Return type:

ndarray

machina.algos.ppo_clip.update_vf(vf, optim_vf, batch, clip_param, clip, max_grad_norm)[source]

Update function for V function.

Parameters:
  • vf (SVfunction) – V function.
  • optim_vf (torch.optim.Optimizer) – Optimizer for V function.
  • batch (dict) – Batch of trajectory
  • clip_param (float) – Clipping ratio of objective function.
  • clip (bool) – If True, vfunc is also updated by clipped objective function.
  • max_grad_norm (float) – Maximum gradient norm.
Returns:

vf_loss – Value of loss function.

Return type:

ndarray

machina.algos.ppo_kl module

This is an implementation of Proximal Policy Optimization in which gradient is clipped by the KL divergence especially. See https://arxiv.org/abs/1707.06347

machina.algos.ppo_kl.train(traj, pol, vf, kl_beta, kl_targ, optim_pol, optim_vf, epoch, batch_size, max_grad_norm, num_epi_per_seq=1)[source]

Train function for proximal policy optimization (kl).

Parameters:
  • traj (Traj) – On policy trajectory.
  • pol (Pol) – Policy.
  • vf (SVfunction) – V function.
  • kl_beta (float) – KL divergence coefficient.
  • kl_targ (float) – Target of KL divergence.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • optim_vf (torch.optim.Optimizer) – Optimizer for V function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • max_grad_norm (float) – Maximum gradient norm.
  • num_epi_per_seq (int) – Number of episodes in one sequence for rnn.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.ppo_kl.update_pol(pol, optim_pol, batch, kl_beta, max_grad_norm)[source]
machina.algos.ppo_kl.update_vf(vf, optim_vf, batch)[source]

machina.algos.prioritized_ddpg module

machina.algos.qtopt module

This is an implementation of QT-Opt. https://arxiv.org/abs/1806.10293

machina.algos.qtopt.train(traj, qf, lagged_qf, targ_qf1, targ_qf2, optim_qf, epoch, batch_size, tau=0.9999, gamma=0.9, loss_type='mse')[source]

Train function for qtopt

Parameters:
  • traj (Traj) – Off policy trajectory.
  • qf (SAVfunction) – Q function.
  • lagged_qf (SAVfunction) – Lagged Q function.
  • targ_qf1 (CEMSAVfunction) – Target Q function.
  • targ_qf2 (CEMSAVfunction) – Lagged Target Q function.
  • optim_qf (torch.optim.Optimizer) – Optimizer for Q function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • tau (float) – Target updating rate.
  • gamma (float) – Discounting rate.
  • loss_type (string) – Type of belleman loss.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.r2d2_sac module

machina.algos.sac module

This is an implementation of Soft Actor Critic. See https://arxiv.org/abs/1801.01290

machina.algos.sac.train(traj, pol, qf, targ_qf, log_alpha, optim_pol, optim_qf, optim_alpha, epoch, batch_size, tau, gamma, sampling)[source]

Train function for soft actor critic.

Parameters:
  • traj (Traj) – Off policy trajectory.
  • pol (Pol) – Policy.
  • qf (SAVfunction) – Q function.
  • targ_qf (SAVfunction) – Target Q function.
  • log_alpha (torch.Tensor) – Temperature parameter of entropy.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • optim_qf (torch.optim.Optimizer) – Optimizer for Q function.
  • optim_alpha (torch.optim.Optimizer) – Optimizer for alpha.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • tau (float) – Target updating rate.
  • gamma (float) – Discounting rate.
  • sampling (int) – Number of samping in calculating expectation.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.svg module

This is an implementation of Stochastic Value Gradient. See https://arxiv.org/abs/1510.09142

machina.algos.svg.train(traj, pol, targ_pol, qf, targ_qf, optim_pol, optim_qf, epoch, batch_size, tau, gamma, sampling)[source]

Train function for deep deterministic policy gradient

Parameters:
  • traj (Traj) – Off policy trajectory.
  • pol (Pol) – Policy.
  • targ_pol (Pol) – Target Policy.
  • qf (SAVfunction) – Q function.
  • targ_qf (SAVfunction) – Target Q function.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • optim_qf (torch.optim.Optimizer) – Optimizer for Q function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • tau (float) – Target updating rate.
  • gamma (float) – Discounting rate.
  • sampling (int) – Number of samping in calculating expectation.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.trpo module

This is an implementation of Trust Region Policy Optimization. See https://arxiv.org/abs/1502.05477

machina.algos.trpo.conjugate_gradients(Avp, b, nsteps, residual_tol=1e-10)[source]

Calculating conjugate gradient

machina.algos.trpo.linesearch(pol, batch, f, x, fullstep, expected_improve_rate, max_backtracks=10, accept_ratio=0.1)[source]
machina.algos.trpo.make_kl(pol, batch)[source]
machina.algos.trpo.train(traj, pol, vf, optim_vf, epoch=5, batch_size=64, num_epi_per_seq=1, max_kl=0.01, num_cg=10, damping=0.1)[source]

Train function for trust region policy optimization.

Parameters:
  • traj (Traj) – On policy trajectory.
  • pol (Pol) – Policy.
  • vf (SVfunction) – V function.
  • optim_vf (torch.optim.Optimizer) – Optimizer for V function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • num_epi_per_seq (int) – Number of episodes in one sequence for rnn.
  • max_kl (float) – Limit of KL divergence.
  • num_cg (int) – Number of iteration in conjugate gradient computation.
  • damping (float) – Damping parameter for Hessian Vector Product.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.trpo.update_pol(pol, batch, make_kl=<function make_kl>, max_kl=0.01, damping=0.1, num_cg=10)[source]
machina.algos.trpo.update_vf(vf, optim_vf, batch)[source]

machina.algos.vpg module

This is an implementation of Vanilla Policy Gradient.

machina.algos.vpg.train(traj, pol, vf, optim_pol, optim_vf, epoch, batch_size, large_batch)[source]

Train function for vanila policy gradient.

Parameters:
  • traj (Traj) – On policy trajectory.
  • pol (Pol) – Policy.
  • vf (SVfunction) – V function.
  • optim_pol (torch.optim.Optimizer) – Optimizer for Policy.
  • optim_vf (torch.optim.Optimizer) – Optimizer for V function.
  • epoch (int) – Number of iteration.
  • batch_size (int) – Number of batches.
  • larget_batch (bool) – If True, batch is provided as whole trajectory.
Returns:

result_dict – Dictionary which contains losses information.

Return type:

dict

machina.algos.vpg.update_pol(pol, optim_pol, batch)[source]
machina.algos.vpg.update_vf(vf, optim_vf, batch)[source]