当前位置：首页 > news >正文

哈尔滨做网站电话搭建一个网站的流程

news 2025/7/27 17:28:04

哈尔滨做网站电话,搭建一个网站的流程,wordpress是什么东西,长春市建设信息网站Taxi-v3环境介绍在 5x5 的网格世界中，设有四个指定的接送点（红色、绿色、黄色和蓝色）。出租车从一个随机方格出发，乘客则位于其中一个指定位置。目标是让出租车移动到乘客所在位置，接上乘客，再将乘客送到…

Taxi-v3环境介绍

在 5x5 的网格世界中，设有四个指定的接送点（红色、绿色、黄色和蓝色）。出租车从一个随机方格出发，乘客则位于其中一个指定位置。目标是让出租车移动到乘客所在位置，接上乘客，再将乘客送到其期望的目的地，最后放下乘客。一旦放下乘客，该回合即告结束。玩家会因成功将乘客送到正确位置而获得正向奖励；若尝试接送乘客的操作有误，或者在未获得其他奖励的每一步行动中，玩家都会获得负向奖励。

动作空间

动作的形状为 (1,)，取值范围是 {0, 5}，用于指示出租车移动的方向，或者指示接送乘客的操作。

0：向南移动（向下）
1：向北移动（向上）
2：向东移动（向右）
3：向西移动（向左）
4：接上乘客
5：放下乘客

观测空间

由于存在 25 个出租车位置、5 种可能的乘客位置（包括乘客在出租车内的情况）以及 4 个目的地位置，因此共有 500 个离散状态。

地图上的目的地用颜色的首字母表示。

乘客位置：
0：红色位置
1：绿色位置
2：黄色位置
3：蓝色位置
4：在出租车内

目的地：
0：红色位置
1：绿色位置
2：黄色位置
3：蓝色位置

观测值以整型（int()）形式返回，该整型编码了对应的状态，其计算方式为 ((出租车所在行 * 5 + 出租车所在列) * 5 + 乘客位置) * 4 + 目的地。需注意，在一个回合中实际能够到达的状态有 400 个。缺失的状态对应于乘客已处于其目的地的情况，因为这通常意味着一个回合的结束。在成功完成一个回合后，还能观察到另外四种状态，此时乘客和出租车都位于目的地。这样一来，总共有 404 个可到达的离散状态。

初始状态

初始状态是从可能的状态中均匀采样得到的，在这些可能的状态中，乘客既不在其目的地，也不在出租车内。可能的初始状态有 300 种：25 种出租车位置、4 种乘客位置（不包括在出租车内的情况）以及 3 种目的地（不包括乘客当前所在位置）。

奖励

每走一步奖励 -1，除非触发了其他奖励。
成功送达乘客奖励 +20。
非法执行“接上”和“放下”乘客的操作，奖励 -10。
导致无操作（noop）的动作，例如撞到墙上，将受到时间步惩罚。通过采样 info 中返回的 action_mask 可以避免无操作情况的发生。

回合结束

当出现以下情况时，回合结束：

终止条件：出租车放下了乘客。
截断条件（当使用 time_limit 包装器时）：回合长度达到 200。

信息

step() 和 reset() 方法返回一个包含以下键的字典：

p：状态转移概率。
action_mask：指示动作是否会导致转移到新状态。

由于出租车环境并非随机（非随机动态系统），因此状态转移概率始终为 1.0。按照 Dietterich 的论文（《The fickle taxi task》）实现一个过渡概率是一个待办事项。在某些情况下，采取某个动作不会对回合的状态产生影响。在 v0.25.0 版本中，info[“action_mask”] 包含一个针对每个动作的np.ndarray，用于指定该动作是否会改变状态。

若要采样一个会改变状态的动作，可以使用以下代码：action = env.action_space.sample(info["action_mask"])或者，在使用基于 Q 值的算法时，可以使用：action = np.argmax(q_values[obs, np.where(info["action_mask"] == 1)[0]])。

import os
import tqdm
import random    
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

# ĺˆ›ĺťşçŽŻĺ˘ƒ
env = gym.make("Taxi-v3", render_mode="rgb_array")state_space = env.observation_space.n
print("There are ", state_space, " possible states")action_space = env.action_space.n
print("There are ", action_space, " possible actions")

There are  500  possible states
There are  6  possible actions

# ĺˆĺ§‹ĺŒ–Q-Table
# Create our Q table with state_size rows and action_size columns (500x6)
def initialize_q_table(state_space, action_space):Qtable = np.zeros((state_space, action_space))return QtableQtable_taxi = initialize_q_table(state_space, action_space)
print(Qtable_taxi)
print("Q-table shape: ", Qtable_taxi.shape)

[[0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]...[0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0.]]
Q-table shape:  (500, 6)

定义超参数：

切勿修改 EVAL_SEED：eval_seed 数组,固定seed可以使用相同的出租车起始位置来评估你的智能体（agent）

# Training parameters
n_training_episodes = 25000  # Total training episodes
learning_rate = 0.7          # Learning rate# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes# DO NOT MODIFY EVAL_SEED
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148]# Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
# Each seed has a specific starting state# Environment parameters
env_id = "Taxi-v3"  # Name of the environment
max_steps = 99      # Max steps per episode
gamma = 0.95        # Discounting rate# Exploration parameters
max_epsilon = 1.0  # Exploration probability at start
min_epsilon = 0.05  # Minimum exploration probability
decay_rate = 0.005  # Exponential decay rate for exploration prob

$\epsilon$ \epsilon$余弦退火衰减策略：
$\epsilon_t = \epsilon_{min} + \frac{1}{2}(\epsilon_{max} - \epsilon_{min})(1 + \cos(\frac{t\pi}{T}))$

在这里插入图片描述

训练Agent

def greedy_policy(Qtable, state):# Exploitation: take the action with the highest state, action valueaction = np.argmax(Qtable[state][:])return actiondef epsilon_greedy_policy(Qtable, state, epsilon):# Randomly generate a number between 0 and 1random_num = random.uniform(0, 1)# if random_num > greater than epsilon --> exploitationif random_num > epsilon:# Take the action with the highest value given a stateaction = greedy_policy(Qtable, state)# else --> explorationelse:action = env.action_space.sample()return action

def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):for episode in tqdm(range(n_training_episodes)):# Reduce epsilon (because we need less and less exploration)#epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)epsilon = min_epsilon + (max_epsilon - min_epsilon)*0.5 * (1 + np.cos(episode*np.pi/n_training_episodes))# Reset the environmentstate, info = env.reset()step = 0terminated = Falsetruncated = False# repeatfor step in range(max_steps):# Choose the action At using epsilon greedy policyaction = epsilon_greedy_policy(Qtable, state, epsilon)# Take action At and observe Rt+1 and St+1# Take the action (a) and observe the outcome state(s') and reward (r)new_state, reward, terminated, truncated, info = env.step(action)# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])# If terminated or truncated finish the episodeif terminated or truncated:break# Our next state is the new statestate = new_statereturn Qtable

Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
print("Qtable_taxi:\n",Qtable_taxi)

  0%|          | 0/25000 [00:00<?, ?it/s]Qtable_taxi:[[ 0.          0.          0.          0.          0.          0.        ][ 2.75200369  3.94947757  2.75200369  3.94947757  5.20997639 -5.05052243][ 7.93349184  9.40367562  7.93349184  9.40367562 10.9512375   0.40367562]...[10.9512375  12.58025    10.9512375   9.40367562  1.9512375   1.9512375 ][ 5.20997639  6.53681725  5.20997639  6.53681725 -3.79002361 -3.79002361][16.1        14.295      16.1        18.          7.1         7.1       ]]

评估Taxi Agent

def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):"""Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.:param env: The evaluation environment:param n_eval_episodes: Number of episode to evaluate the agent:param Q: The Q-table:param seed: The evaluation seed array (for taxi-v3)"""episode_rewards = []for episode in tqdm(range(n_eval_episodes)):if seed:state, info = env.reset(seed=seed[episode])else:state, info = env.reset()step = 0truncated = Falseterminated = Falsetotal_rewards_ep = 0for step in range(max_steps):# Take the action (index) that have the maximum expected future reward given that stateaction = greedy_policy(Q, state)new_state, reward, terminated, truncated, info = env.step(action)total_rewards_ep += rewardif terminated or truncated:breakstate = new_stateepisode_rewards.append(total_rewards_ep)mean_reward = np.mean(episode_rewards)std_reward = np.std(episode_rewards)return mean_reward, std_reward

# Evaluate  Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_taxi, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

  0%|          | 0/100 [00:00<?, ?it/s]Mean_reward=7.56 +/- 2.71

可视化Taxi Agent

env = gym.wrappers.RecordVideo(env, video_folder="./Taxi-v3-QL",disable_logger=True,fps=30)
state, info = env.reset()
for step in range(max_steps):action = greedy_policy(Qtable_taxi, state)state, reward, terminated, truncated, info = env.step(action)if terminated == True:break
env.close()