哈尔滨做网站电话搭建一个网站的流程
Taxi-v3环境介绍
在 5x5 的网格世界中,设有四个指定的接送点(红色、绿色、黄色和蓝色)。出租车从一个随机方格出发,乘客则位于其中一个指定位置。目标是让出租车移动到乘客所在位置,接上乘客,再将乘客送到其期望的目的地,最后放下乘客。一旦放下乘客,该回合即告结束。玩家会因成功将乘客送到正确位置而获得正向奖励;若尝试接送乘客的操作有误,或者在未获得其他奖励的每一步行动中,玩家都会获得负向奖励。
动作空间
动作的形状为 (1,),取值范围是 {0, 5},用于指示出租车移动的方向,或者指示接送乘客的操作。
0:向南移动(向下)
1:向北移动(向上)
2:向东移动(向右)
3:向西移动(向左)
4:接上乘客
5:放下乘客
观测空间
由于存在 25 个出租车位置、5 种可能的乘客位置(包括乘客在出租车内的情况)以及 4 个目的地位置,因此共有 500 个离散状态。
地图上的目的地用颜色的首字母表示。
乘客位置:
0:红色位置
1:绿色位置
2:黄色位置
3:蓝色位置
4:在出租车内
目的地:
0:红色位置
1:绿色位置
2:黄色位置
3:蓝色位置
观测值以整型(int())形式返回,该整型编码了对应的状态,其计算方式为 ((出租车所在行 * 5 + 出租车所在列) * 5 + 乘客位置) * 4 + 目的地。需注意,在一个回合中实际能够到达的状态有 400 个。缺失的状态对应于乘客已处于其目的地的情况,因为这通常意味着一个回合的结束。在成功完成一个回合后,还能观察到另外四种状态,此时乘客和出租车都位于目的地。这样一来,总共有 404 个可到达的离散状态。
初始状态
初始状态是从可能的状态中均匀采样得到的,在这些可能的状态中,乘客既不在其目的地,也不在出租车内。可能的初始状态有 300 种:25 种出租车位置、4 种乘客位置(不包括在出租车内的情况)以及 3 种目的地(不包括乘客当前所在位置)。
奖励
- 每走一步奖励 -1,除非触发了其他奖励。
- 成功送达乘客奖励 +20。
- 非法执行“接上”和“放下”乘客的操作,奖励 -10。
- 导致无操作(noop)的动作,例如撞到墙上,将受到时间步惩罚。通过采样 info 中返回的 action_mask 可以避免无操作情况的发生。
回合结束
当出现以下情况时,回合结束:
- 终止条件:出租车放下了乘客。
- 截断条件(当使用 time_limit 包装器时):回合长度达到 200。
信息
step()
和 reset()
方法返回一个包含以下键的字典:
- p:状态转移概率。
- action_mask:指示动作是否会导致转移到新状态。
由于出租车环境并非随机(非随机动态系统),因此状态转移概率始终为 1.0。按照 Dietterich 的论文(《The fickle taxi task》)实现一个过渡概率是一个待办事项。在某些情况下,采取某个动作不会对回合的状态产生影响。在 v0.25.0 版本中,info[“action_mask”] 包含一个针对每个动作的np.ndarray
,用于指定该动作是否会改变状态。
若要采样一个会改变状态的动作,可以使用以下代码:action = env.action_space.sample(info["action_mask"])
或者,在使用基于 Q 值的算法时,可以使用:action = np.argmax(q_values[obs, np.where(info["action_mask"] == 1)[0]])
。
import os
import tqdm
import random
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
# ĺˆ›ĺťşçŽŻĺ˘ƒ
env = gym.make("Taxi-v3", render_mode="rgb_array")state_space = env.observation_space.n
print("There are ", state_space, " possible states")action_space = env.action_space.n
print("There are ", action_space, " possible actions")
There are 500 possible states
There are 6 possible actions
# ĺˆĺ§‹ĺŒ–Q-Table
# Create our Q table with state_size rows and action_size columns (500x6)
def initialize_q_table(state_space, action_space):Qtable = np.zeros((state_space, action_space))return QtableQtable_taxi = initialize_q_table(state_space, action_space)
print(Qtable_taxi)
print("Q-table shape: ", Qtable_taxi.shape)
[[0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]...[0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]]
Q-table shape: (500, 6)
定义超参数:
切勿修改 EVAL_SEED:eval_seed
数组,固定seed
可以使用相同的出租车起始位置来评估你的智能体(agent)
# Training parameters
n_training_episodes = 25000 # Total training episodes
learning_rate = 0.7 # Learning rate# Evaluation parameters
n_eval_episodes = 100 # Total number of test episodes# DO NOT MODIFY EVAL_SEED
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148]# Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
# Each seed has a specific starting state# Environment parameters
env_id = "Taxi-v3" # Name of the environment
max_steps = 99 # Max steps per episode
gamma = 0.95 # Discounting rate# Exploration parameters
max_epsilon = 1.0 # Exploration probability at start
min_epsilon = 0.05 # Minimum exploration probability
decay_rate = 0.005 # Exponential decay rate for exploration prob
ϵ \epsilon ϵ\epsilon$余弦退火衰减策略:
ϵ t = ϵ m i n + 1 2 ( ϵ m a x − ϵ m i n ) ( 1 + cos ( t π T ) ) \epsilon_t = \epsilon_{min} + \frac{1}{2}(\epsilon_{max} - \epsilon_{min})(1 + \cos(\frac{t\pi}{T})) ϵt=ϵmin+21(ϵmax−ϵmin)(1+cos(Ttπ))
训练Agent
def greedy_policy(Qtable, state):# Exploitation: take the action with the highest state, action valueaction = np.argmax(Qtable[state][:])return actiondef epsilon_greedy_policy(Qtable, state, epsilon):# Randomly generate a number between 0 and 1random_num = random.uniform(0, 1)# if random_num > greater than epsilon --> exploitationif random_num > epsilon:# Take the action with the highest value given a stateaction = greedy_policy(Qtable, state)# else --> explorationelse:action = env.action_space.sample()return action
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):for episode in tqdm(range(n_training_episodes)):# Reduce epsilon (because we need less and less exploration)#epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)epsilon = min_epsilon + (max_epsilon - min_epsilon)*0.5 * (1 + np.cos(episode*np.pi/n_training_episodes))# Reset the environmentstate, info = env.reset()step = 0terminated = Falsetruncated = False# repeatfor step in range(max_steps):# Choose the action At using epsilon greedy policyaction = epsilon_greedy_policy(Qtable, state, epsilon)# Take action At and observe Rt+1 and St+1# Take the action (a) and observe the outcome state(s') and reward (r)new_state, reward, terminated, truncated, info = env.step(action)# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])# If terminated or truncated finish the episodeif terminated or truncated:break# Our next state is the new statestate = new_statereturn Qtable
Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
print("Qtable_taxi:\n",Qtable_taxi)
0%| | 0/25000 [00:00<?, ?it/s]Qtable_taxi:[[ 0. 0. 0. 0. 0. 0. ][ 2.75200369 3.94947757 2.75200369 3.94947757 5.20997639 -5.05052243][ 7.93349184 9.40367562 7.93349184 9.40367562 10.9512375 0.40367562]...[10.9512375 12.58025 10.9512375 9.40367562 1.9512375 1.9512375 ][ 5.20997639 6.53681725 5.20997639 6.53681725 -3.79002361 -3.79002361][16.1 14.295 16.1 18. 7.1 7.1 ]]
评估Taxi Agent
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):"""Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.:param env: The evaluation environment:param n_eval_episodes: Number of episode to evaluate the agent:param Q: The Q-table:param seed: The evaluation seed array (for taxi-v3)"""episode_rewards = []for episode in tqdm(range(n_eval_episodes)):if seed:state, info = env.reset(seed=seed[episode])else:state, info = env.reset()step = 0truncated = Falseterminated = Falsetotal_rewards_ep = 0for step in range(max_steps):# Take the action (index) that have the maximum expected future reward given that stateaction = greedy_policy(Q, state)new_state, reward, terminated, truncated, info = env.step(action)total_rewards_ep += rewardif terminated or truncated:breakstate = new_stateepisode_rewards.append(total_rewards_ep)mean_reward = np.mean(episode_rewards)std_reward = np.std(episode_rewards)return mean_reward, std_reward
# Evaluate Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_taxi, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
0%| | 0/100 [00:00<?, ?it/s]Mean_reward=7.56 +/- 2.71
可视化Taxi Agent
env = gym.wrappers.RecordVideo(env, video_folder="./Taxi-v3-QL",disable_logger=True,fps=30)
state, info = env.reset()
for step in range(max_steps):action = greedy_policy(Qtable_taxi, state)state, reward, terminated, truncated, info = env.step(action)if terminated == True:break
env.close()