Real-time dispatch strategy for microgrid considering source-load uncertainty: a tailored TD3 reinforcement learning approach

Shenpeng Xianga ,Mohan Lina ,Zhe Chenb,* ,Pingliang Zenga,* ,Xiangjin Wangb ,Diyang Gongb

a Department of Automation, Hangzhou Dianzi University,Hangzhou 310018,PR China

b Electric Power Research Institute of State Grid Zhejiang Electric Power Co.,Ltd.,Hangzhou 310014,PR China

Abstract

The integration of large-scale-distributed new energy resources has led to heightened source load uncertainty.As energy prosumers,microgrids urgently require enhanced real-time regulation capabilities over controllable resources amid uncertain environments,rendering real-time and rapid decision-making a critical issue.This paper proposes a tailored twin delayed deep deterministic policy gradient(TD3) reinforcement learning algorithm that explicitly accounts for source load uncertainty.First,following an expert experience-based methodology,Gaussian process regression was implemented using the radial basis function covariance with historical source and load data.The parameters were adaptively adjusted by maximum likelihood estimation to generate the expected curves of demand and wind solar power generation,along with their 95% confidence regions,which were treated as representative uncertainty scenarios.Second,the traditional scheduling model was transformed into a deep reinforcement learning (DRL) environment through a Markov process.To minimize the total operational cost of the microgrid,the tailored TD3 algorithm was applied to formulate rapid intraday scheduling decisions.Finally,simulations were conducted using real historical data from an actual region in Zhejiang province,China,to verify the efficacy of the proposed method.The results demonstrate the potential of the algorithm for achieving economic scheduling for microgrids.

Keywords: Microgrid;Deep reinforcement learning;Tailored TD3 algorithm;Intraday real-time scheduling;Gaussian process regression

0 Introduction

With the rapid growth in distributed renewable energy generation in microgrids and the rising number of electric vehicles (EVs),source load uncertainty in microgrids has been further exacerbated.The challenge of ensuring users’electricity supply while maintaining the stable operation of microgrids has emerged as a critical research focus,attracting widespread scholarly attention [1–3].

At present,optimal microgrid operation can be achieved through methods such as empirical,probabilistic model-based,and classical scenario-based methods [4].In particular,the empirical method selects typical days based on historical expert experience [5,6].The probabilistic method generates source load scenarios by combining the statistical experience or probability distributions of historical data with different sampling techniques [7,8].Meanwhile,the classical scenario-based method,rooted in a style="font-size: 1em; text-align: justify; text-indent: 2em; line-height: 1.8em; margin: 0.5em 0em;">In the formulation of microgrid operation strategies,traditional optimization constructs mathematical models based on practical engineering practices,simplifies and processes these models through mathematical techniques,and ultimately solves problems using optimization algorithms.These approaches include mixed-integer programming,nonlinear programming,and dynamic programming.Meanwhile,the solution methods encompass Lagrangian relaxation,the branch-and-bound method,Newton’s method,and intelligent algorithms.Ref.[13]established a bi-level optimization model to minimize the carbon emissions and operational costs of the system.Ref.[14]developed an optimal scheduling model considering carbon capture and power-to-gas technologies.However,the aforementioned methods suffer from slow solution convergence owing to the dimensions of complex systems and convergence failure due to the requirements of scenarios with real-time constraints [15].

style="font-size: 1em; text-align: justify; text-indent: 2em; line-height: 1.8em; margin: 0.5em 0em;">Ref.[28]designed an energy management system(EMS) for hybrid electric vehicles (HEV),utilizing the twin delayed DDPG (TD3) algorithm to improve the agent training efficiency and reduce the fuel consumption of HEVs.Ref.[29]utilized the TD3 algorithm to address power system operation safety problems and validated the method’s effectiveness and applicability in actual power system operation scenarios.Training results revealed that the algorithm still exhibited slow convergence induced by the random sampling of data and required a large number of iterations.Ref.[30]proposed an energy management strategy based on the TD3 algorithm with prioritized experience replay (PER-TD3).By utilizing the prioritized experience sampling method,the training process of TD3 was accelerated,and superior economic performance was attained.Ref.[31]proposed a sum-tree (ST) data structure-based TD3 algorithm to address the low-carbon economic dispatch of an IES.

Based on the existing TD3 algorithm,this study uses the ST data structure [32]to store and sample historical experience data and implements prioritized experience replay(PER),thereby improving the training efficiency and performance of the TD3 algorithm.First,an MDP model was constructed for the intraday real-time scheduling scenarios of microgrids.Subsequently,typical scenarios were generated using GPR,and a decision-making interaction mechanism was established to train the agent’s decisionmaking capability.During the training process,priorities were set based on value calculations of data updates.Next,by leveraging both the data itself and its priorities stored in the ST data structure,a hierarchical sampling method was employed to ensure the coverage of samples with “high,”“medium,” and “low” priorities,thereby enhancing the training efficiency.Finally,simulations were conducted to validate the effectiveness of the proposed method in microgrid intraday scheduling.Comparisons with different a lgorithms further demonstrated improvements in training efficiency and reductions in scheduling costs.

The main contributions of this study are summarized as follows:

To address microgrid generation-load uncertainties across different time scales,a GPR model based on the Bayesian framework is proposed.This model constructs confidence intervals using the mean and variance of historical generation-load data,enabling the generation of scenarios with the required degree of uncertainty.

Considering the slow solution speed of the traditional mixed-integer linear programming (MILP) method,this paper proposes an intraday real-time microgrid scheduling method based on DRL to rapidly generate unit dispatch schemes.Its advantage lies in the fact that the agent learns from a large number of historical generation-load scenarios during offline training,generating near-optimal strategies for real-time scheduling.

An improved PER-TD3 algorithm integrating an ST data structure is proposed.While reducing the update complexity,the algorithm parameters were adjusted through sensitivity analysis and ablation experiments.After ad justment,the algorithm fully explored and utilized high-value experience data,significantly improving the agent’s training efficiency and stability.

1 Scheduling model of microgrid considering source load uncertainties

1.1 Source–load uncertainty scenario generation method

GPR,leveraging a statistical approach to model historical data patterns,excels in handling nonlinear and highdimensional problems due to its inherent uncertainty quantification capability.

dc11aba03bb98f8c6c3dab219397ecb0.jpg

The GPR can be expressed as

0bd5b97d4bea355fad69163344eb4184.jpg
a6a6e74ac8e1c14e4c5fe8e13e3a3126.jpg

Accordingly,the posterior distribution of the test set output vectorf 878dc325d28071040379d0389c0ccba1.pngis

455b26d709d191452c80d14be3340514.jpg

where 27e72cd3e7ef47abbf84de1895c31396.jpgdenotes the inner expectation function,frepresents the expectation,and a confidence interval is constructed using the mean and variance to serve as the uncertain scenario.

The confidence interval with a confidence level of 1e309c428ad7f0a67d5e611dd19abadd3.pngα is:

41b7c17bd162cf8f4fe6dbd3f3ed0519.jpg

where 9e5c758157dfc05ae96a7e9ee56d40fc.jpgdenote the upper and lower bounds of the confidence interval,respectively,and 5807539bb5f7212f9a8e140d2938a43a.jpgrepresents the quantile under the corresponding confidence level.

1.2 Objective function

The objective function of the microgrid intraday scheduling model is as follows:

131a3d73a9a6e92669451f3707361771.jpg
0bc0d7a7f4dd966685c55014669290cf.jpg

1.3 Constraints

Du ring the intra day scheduling,the constraints of micro grid operation are as follows:

1) Power balance equality constraint:

11a6b94a7f3c0fbf0436c092a5f2b451.jpg

2) Generation output inequality constraints (upper/lower bounds):

58f5fdf656f40e7c0affa538ca58f884.jpg

3) The tie line power constraint:

aa8a72629a60af4ac4190fbd23ccb18f.jpg

4)The energy storage charging and discharging constraint:

953324e29135c95a1bb5305da97d230f.jpg

where 5cdb6fe0a1f1b09b961d045d7694dd00.jpgdenotes the energy storage charging/discharging status (1 represents charging,0 represents discharging);ηdenotes the charging/discharging efficiency;591a5071af1f345771cca3a4dc59d559.jpgdenotes the storage capacity at the initial scheduling time;and 7ac88b1a3fd4044373f389686c6121ae.jpgdenote the maximum and minimum allowable remaining capacities of the storage during the scheduling process.

5) EV charging and dischar ging constraint:

d8fb8aea93eee5889a6512f025ca90f4.jpg

2 Equivalent transformation of the microgrid scheduling model based on tailored TD3

2.1 Data preprocessing

To ensure the effective exploration of the agent and mitigate the negative impact of parameter order-of-magnitude differences on the agent’s training convergence results,all training data is normalized.The equation is:

57247e169276cbc7397256e480ee0375.jpg

where Xnormalized is the normalized result,X represents the pre-normalization value,and Xmin and Xmax are the minimum and maximum value in the sample sequence data,respectively.

2.2 Decision-making based on Markov process

In this study,the traditional microgrid model is transformed into a Markov process,whose basic components include 7c55c03ba4aaf29a2fa25c388b76a4a2.jpgwhere S represents the set of environmental states,A represents the set of agent actions,R denotes the reward for the agent,and γ represents the discount factor,which balances short-term and long-term rewards.Its environment,states,and actions are as follows:

1) Environment

1ef6e0d5c71946ca81223e0c5777b44a.jpg

where st denotes the observed operatingstate provided by the environment to the agent;c75b3b6facb6f43d250c94d60dfdec29.jpgis the power demand to be met at time t;and 7ce9e7aa8f267c1047731cdac25d723b.jpgrepresents the state of charge(SOC) of the energy storage at the previous time step.

2) Action

c567627d5739fec031c6bf346c16f414.jpg

3) Reward and penalty

948a060c4ce34cef134cfb8d0175a325.jpg

where σ1 denotes the scaling factor for the control costs;σ2 represents the scaling factor for the execution constraint penalties;r and ccan be represented as operational costs and constraint penalty values,respectively;a6ffd8e3e46598e0d6245fda6931acd2.jpg7f75d72a0db8ad9b49223b4212b66386.jpgis the penalty for energy storage SOC out-ofbounds at time t;62645385369eec4990e1bfde96d7be35.jpgdenote the penalties denotes the MT out-of-bounds violation penalty at time for the power grid power purchase and sale out-ofbounds at time t;and other constraints such as the output upper and lower bounds can be normalized using the Tanh function.

2.3 ST-PER-TD3 algorithm

The TD3 algorithm is an optimized improvement method for the deterministic policy method DDPG [33].Building on the DDPG,it introduces two independent critic networks to estimate the Q-values of the stateaction pairs,effectively reducing the overestimation of the action Q-values in the DDPG algorithm.The methodology of TD3 is as follows: First,to enhance the agent’s action exploration capability and smoothen the policy expectation during parameter updates,behavioral policy noise and target policy noise are separately added to the actor.Second,to mitigate overestimation,both the real and target networks of the critic employ twin networks.Finally,to improve the stability of the output policy,the parameters of the actor’s real network are updated with a delay only after the critic network has undergone multiple updates.However,TD3 performs random data sampling during training,which results in low training efficiency and slow reward convergence.As a tree-based data structure in computing,the ST’s logical architecture A aligns with TD3 s data storage and access requirements,and its integration into the algorithm can enhance the data processing efficiency.

This paper introduces the ST into the experience replay buffer,sets a priority indicator for experience data,and aims to achieve efficient PER and increase the utilization of high-update-value experience data,thereby improving the training efficiency of the agent.

In this report,the critic network employs the actionvalue function to compute the TD error:

46664269fd89f3791dc4ab376b543cce.jpg

where ρl and δdenote the sampling priority probability and corresponding TD error of the l th experience data,respectively,and υ is the weighting factor.

υ=0 corresponds to uniform sampling,and υ=1 corresponds to greedy strategy sampling.To reduce the discrepancy in sampling probability between data with larger δ and smaller δ,this report sets υ=0 6.Additionally,to avoid missing experience data with a very small TD error,new experience data are initialized as follows:

95972bef7452fd3c83d0aefb0a50c57f.jpg

where δl0 denotes the TD error of the l th experience data when it is initially added to the experience replay buffer B,and δmax represents the maximum TD error within the experience replay buffer B,which aims to increase the sampling probability of experience data with very small TD errors.

The microgrid economic dispatch model based on the tailored TD3 method is shown in Fig.1.The specific training process of the agent is as follows:

3076ad197a2a5972973b9312fb55d2c9.jpg

Fig.1.Tailored TD3 algorithm framework.

1) Initialize the parameters θ1,θ2,and φof the three real networks of critic1,critic2,and the actor network.Initialize the parameters of the three target networks with the same parameter values,namely57547955fd7cde4911d687dbfa0e4b5b.jpg

2) Set the capacity of the experience replay buffer B and the number of training samples N.

3) Acquire and add experience data tuple s to B.

a) Randomly select an initial state t from historical data.Under a noisy environment,πφselects an ac tion a t based on the state st:

eabb8def5b0d221386c21053eaf476aa.jpg

b) In ter act with the environment using the action at,thereby obtaining the reward value rt and the next state te326dde65686edbb1919bb7a8a1358f2.png1,fo rm ing a data tuple188b0879790662616206cd93cdfed335.jpg

c) Using the data’s TD error as its priority metric,the node values of the relevant nod esin the ST are updated accordingly.

d) Check the number of experience data entries in buffer B.If the number has not reached the set capacity limit,use the current st81164f67bf83eef16638d336de0607d8.png1 as the st in step 2,and repeat the above steps;otherwise,stop adding data and assign the largest δ in buffer B to the corresponding nodes of each data entry.

4) Sample N data entries from buffer B,and add a noise 37dbc852529cd8439e15f9e2fa45b9b0.pngbased on the target policy smoothing regularization to each data entry according to9da98eac7a8e696e1d67b02396433c63.png to obtain the target action at5eaa31e85900f465c625fbf86f289875.png1 correspondingto std76b2c492779fb8e14021d767b2591fa.png1:

962481613e1b6ecf1da021ef855b7488.jpg

where ψdenotes the noise truncation value.

5) Record the obtained7b7207060285622b6ec42d34d7c8c052.jpgand the observed rewarda6afb82e03e834dbe7b1b6a19105b11f.jpgthen input them into the two critic target networks to compute the target value t:

c5eeb2e1ccf8f2036bf9cfb86afc796f.jpg

6) Based on the gradient descent algorithm,we minimize the error between the target value and the observed value to thereby update the parameter s θof the two critic real networks.

c811e0c7ecbd550fc56a4e496ea059ea.jpg

7) Using a learning rate of τ1,compute the weighted average of the parameters of the real network and the target network to soft-update the parameters of the target network.

5a7091e5ebd1ae792944f26e95dec70a.jpg

8) Recalculate the δof the data entry and update the nodevalues of its corresponding leaf node and relevantnodes in the ST.

9) After the critic network has been updated ford steps,the parameters φof the actor real network are sim ilarly updated using the gradient descent algorithm.

eef11f27c1ee12509044850bfff3c891.jpg

10) Use a learning rate of τ2 to soft-update the parameters of the actor target network.

0b9017e314be09ff9bed4f3ca94fc7e8.jpg

Loop through steps (4) (9) and record the reward values until the number of rounds reaches the maximum.

3 Case studies

The simulation of new energy power generation,realtimeelectricity prices,and electric demand is conducted usingmonthly real-world data from an actual region in Zhejiang Province,China,with a sampling interval of 15 min.Since the selection of covariance functions has been extensively studied in existing literature [34],using the expert experience method [35],we selected a radial basisfunction (RBF) covariance function using the Python 3.10 soft ware.Eleven features were input into the model,specifically including month,day-of-week features,load values,wind speed,and surface radiation.The adaptive adjustment of GPR parameters was performed using the gp.fit()function from the scikit-learn library,and negative values were truncated.The single-day regression curve obtained from training,along with its 95% confidence region,was used to generate the uncertainty scenarios for intraday scheduling.

The input dimension of the DRL neural network is 7,corresponding to 7 state variables in the state space.The intermediate layers consist of 2 fully connected layers with 256 neurons each,and the rectified linear unit (ReLU) activation function was employed in these intermediate layers.The final output used the tanh activation function to map values to the range [5611e347098fa64b09c11b04f19617c3.png1,1].The output layer has a dimension of 2,indicating 2 continuous actions in the action space.During training,the learning rates of the actor and critic networks were set to [α]and [β],respectively,with a decay rate of 0.99.The training configuration includes a sample library capacity of 20,000,a total of 100,000 training episodes,and 96 optimization steps per episode(each step spanning 15 min).Samples(200 in number)were collected from the critic network’s sample library every 480 episodes to update the actor network.The energy storage battery has a maximum capacity of 1000 kWh,with an initial SOC of 0.5.The first 1000 episodes were dedicated to storing environmental samples in the sample library.After the initial sample library reached 1000 entries,the reinforcement learning algorithm began to search for the optimal action policy.The parameters of the TD3 algorithm are listed in Table 1,those of generation and energy storage are provided in Table 2,and those of time-of-use electricity purchase and sale prices are shown in Table 3.

Table 1 Hyperparameters of the deep reinforcement learning algorithm.

5d28529507f08bceff58cdef1de56c55.jpg

Table 2 Parameters of the distributed generator s and energy storage equipment.

f4ed698c7e4f8706dc41ae675f70948b.jpg

Table 3 Time-of-use electricity price parameters.

045d8b55951ad9e58c4edd806f2bbfc7.jpg

Experiments were run on machines equipped with Intel(R) Xeon(R) E5-2666v3 CPU cores,32 GB memory,and NVIDIA GeForce GTX 1080.

All test cases were executed on identical hardware and within the same Python environment.Neural network models were trained using PyTorch 1.12.0 as the framework,wi th consistent neural network parameters.The computational results were compared with those of the following five algorithms:

B1: DDPG algori thm.

B2: TD3 algori thm.

B3: rank-based PER-TD3 algorithm.

B4: pure-greedy-based PER-TD3 algorithm.

B5: traditional robust scheduling method.

The description of rank-based and pure-greedy-based PER-TD3 algorithms can be found in Appendix A.

Notably,5 independent experiments were conducted for each sample,and we used 95% bootstrap percentile confidence intervals to describe the uncertainty in reward estimation [36].

During real-time operation,a 95% confidence interval of GPR is used to simulate the uncertainty of wind power output,photovoltaic power output,and demand (Fig.2).

4464073f82ded38cdd1a6fa36495899e.jpg

Fig.2.GPR curve and its 95% confidence interval.

The convergence of the reward cu rves is shown in Fig.3,which indicates that all baselines achieved their optimization objectives,after which the trained neural network models were saved for real-time scheduling.During training,B1,relying on a single critic network to evaluate action outcomes,is prone to overestimating Q-values,leading to significant fluctuations in reward values.B2 uses dual critic networks to estimate Q-values,which reduces reward fluctuations but results in slower convergence.B3 employs a rank-based prioritization sampling strategy,which discards the specific numerical values of TD errors.As a result,certain important transitions with low TD errors and lower ranking tend to be underestimated.For instance,during episodes 200–400,the algorithm occasion-ally failed to learn effective policies in certain training instances.B4 consistently replays the experience with the highest priority.While this approach accelerates convergence,it also leads to repeated sampling of high-error transitions.Consequently,samples with low TD errors—despite their potential importance—are rarely selected,resulting in slower convergence as reflected in the performance curve.As a conventional static robust optimization routine,B5 does not interact with the environment and thus produces no reward curve;instead,it yields a single worst-case solution by solving the min–max program off-line.The proposed ST-PER-TD3 algorithm achieves good convergence at the start of episodes and attains the optimal final cost.

e8691fdd177a2bb7538dfe08856db857.jpg

Fig.3.Reward curves of different algorithms.

The real-time scheduling results of different algorithms are shown in Fig.4,and Table 4 presents the comparison of the dispatching resul ts using different algorithms.

Table 4 Comparison of system operation average results by different methods.

cb114efa62be0efc53deb8bdff1c5395.jpg
d3cf221caab4da9edd2d9e5717332c6a.jpg

Fig.4.Power balance diagrams of scheduling strategies for various methods.

Fig.4 presents the results of different algorithms.The energy storage SOC values exhibit certain upward and downward fluctuations.This is a normal phenomenon.B1,B2,and B3 have similar real-time decisions within days,which can make better decisions for different moments.During intraday dispatching with B4,energy storage charging and discharging are frequently performed,resulting in higher unit operating costs and higher total costs.B5 offers a robust intraday optimal scheduling.The worst-case optimal solution is obtained by the Gurobi solver,and the decision-making speed is slow.The proposed algorithm achieves better results in electricity trading when the unit operation and maintenance cost is low,resulting in a lower intraday dispatching cost.

Table 4 demonstrates that the total cost of the proposed method,at 11,680 CNY,is the lowest and is 0.95%,0.78%,0.15%,6.08%,and 0.87% lower than that of B1,B2,B3,B4,and B5,respectively.The results indicate that the proposed ST-PER-TD3 algorithm generates superior strategies.Additionally,by screening key information using the TD error,the ST-PER-TD 3 algorithm effectively mitigates local convergence problems during training,with both the convergence speed and stability of its reward curve surpassing those of the other TD3 algorithms.It can therefore be concluded that the ST-PER-TD3 algorithm achieves a superior operating strategy.

Additionally,the solution time of the traditional robust optimization method is 0.93 s,whereas that of the trained ST-PER-TD3 algorithm is 0.055 s.The proposed ST-PERTD3 algorithm is approximately 17 times faster in solution speed than the traditional method.Compared with the TD3 algorithm,the ST-PER-TD3 algorithm achieves an improvement of approximately 2.42 times in convergence speed.

To investigate the impact of key parameters on the performance of the proposed ST-PER-TD3 algorithm,we employed the logarithmic spacing method,sampling uniformly within the range of [10–2,10–5 ].For each sampling point,five repeated experiments were conducted,and the smoothed reward values and their 95% bootstrap percentile confidence intervals were plotted.The learning rates of the actor and critic network test samples are0d2ba06dfd1f1a90b905b98e45721925.jpg

98b2f3940690ee9838326d6fb1d01f56.jpg
31504846b6b1ab1ed1f3b3945aa3da75.jpg

Fig.5.Variation of the agent’s average reward value under different parameter settings.

Meanwhile,to guide the agent to explore the environment extensively during the early training stage,avoid falling into local optima,and refine the policy during the middle-to-late training stages to enhance the stability of the policy,a noise decay strategy was adopted to dynamically adjust the noise magnitude.We selected different initial σ4df4794abaf016a349989e0f07ef3b09.jpgto conduct ablation experiments,aiming to investigate the convergence performance of the agent under different noise levels.The σ was linearly decayed to 0.1 over 500 steps.

Fig.6 shows the average training rewards of the agent and their 95% confdience intervals under different initial noise standard deviations.When there is no noise in the actions(i.e.,σ=0),the agent exhibits poor environmental exploration capability and converges slowly.When σ=0 1,the agent’s convergence speed accelerates.When σ was set to 0.2 or 0.3,the agent sufficiently explores the environment,and the reward curve achieves favorable convergence.However,σ=0 3resulted in more stable convergence performance.Therefore,this value was selected.

8e63deefbdaa1023f153d3a7223b78af.jpg

Fig.6.Variation of the agent’s average reward value under different initial noise values.

4 Conclusion

This study developed a DRL algorithm tailored for the intraday scheduling of microgrids under source load uncertainty.The optimal parameters of the ST-PERTD3 algorithm for microgrid intraday scheduling tasks were determined by accelerating the PER sampling process through the ST data structure and utilizing a proportional prioritization sampling strategy to select more valuable experiences for agent training.Case studies demonstrate that the proposed ST-PER-TD3 algorithm exhibits outstanding performance in both convergence speed and average intraday scheduling cost.In the future,we will study the robustness of the system under communication delay and errors between devices and attempt to extend it to multi-microgrid systems.

CRediT authorship contribution statement

Shenpeng Xiang: Writing– original draft,Formal analysis,Data curati on,Conceptualization. Mohan Lin: Writing– review &editing,Methodology,Formal analysis,Conceptualization.Zhe Chen: Supervision,Methodology,Funding acquisition.Pingliang Zeng: Writing– review &editing,Supe rvision,Resources. Xiangjin Wang: Supervision,Methodology,Data curation. Diyang Gong: Resources,Methodology.

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Zhe Chen,Xiangjin Wang,and Diyang Gong are currently employed by Electric Power Research Institute of State Grid Zhejiang Electric Power Co.,Ltd.The other authors declare th at they have no known competing fniancial interests or personal relationships that could have appeared to infulence the work reported in this paper.

Acknowledgments

This work was supported in part by Science and Technology Project of State Grid Corporation of China (No.5400-202319829A-4-1-KJ).

Appendix A

The rank-based PER-TD3 algorithm defines the experience sampling probability based on the absolute value of the TD-error,with its formula as:

a044a963ccc715022c03acf9fb376827.jpg

where υrepresents the weighting fact or;pi =1857cf9d1317502f0bf59f7c502c538f7.pngranki denotes the priority assigned to experience i,where ranki indicates the ranking of the ith experience among all experiences.

The pure-greedy-based PER-TD3 algorithm consistently selects the current highe st-priority samples for replay.

References

[1]Multi-level optimal energy management strategy for a grid tied microgrid considering uncertainty in weather conditions and load,Sci.Rep.,(n.d.).https://www.nature.com/articles/s41598-024-59655-7 (accessed June 17,2025).

[2]EV and RE Uncertainty Analysis in Micro-Grids,IEEE Conference Publication,IEEE Xplore,(n.d.).https://ieeexplore.ieee.org/document/10474737 (accessed June 17,2025).

[3]S.Ray,A.M.Ali,T.Eshchanov,E.Khudoynazarov,Optimal energy management of the smart microgrid considering uncertainty of renewable energy sources and demand response programs,(2025).

[4]J.Zhang,Y.Cai,X.Tang,Y.Tan,Y.Cao,J.Li,Information gap decision theory-spectrum clustering typical scenario generation method considering source load uncertainty,J.Shanghai Jiao Tong Univ.(2023),https://doi.org/10.16183/j.cnki.jsjtu.2023.580.

[5]L.Li,W.Tang,M.Bai,et al.,Multi-objective locating and sizing of distributed generators based on time-sequence characteristics,Automat.Electr.Power Syst.37 (3) (2013) 58–63,https://doi.org/10.7500/AEPS201204046.

[6]M.Ding,J.Xie,X.Liu,W.Shi,The generation method and application of wind resources/load typical scenario set for evaluation of wind power grid integration,Proc.CSEE 36 (15)(2016) 4064–4071,https://doi.org/10.13334/j.0258-8013.pcsee.152854.

[7]L.Li,W.Tang,M.Bai,L.Zhang,T.Lu,Multi-objective locating and sizing of distributed generators based on time-sequence characteristics 58–63+128,Dianli Xitong Zidonghua/Automat.Electr.Power Syst.37 (2013),https://doi.org/10.7500/AEPS201204046.

[8]J.Li,Review of wind power scenario generation methods for optimal operation of renewable energy systems,Appl.Energy(2020).

[9]B.Bai,M.Han,J.Lin,et al.,Scenario reduction method of renewable energy including wind power and photovoltaic,Power Syst.Protect.Control 49 (15) (2021) 141–149.

[10]A.B.Krishna,A.R.Abhyankar,Time-coupled day-ahead wind power scenario generation: a combined regular vine copula and variance reduction method,Energy 265 (2023),https://doi.org/10.1016/j.energy.2022.126173 126173.

[11]P.Wang,L.Fan,Z.Cheng,A joint state of health and remaining useful life estimation approach for lithium-ion batteries based on health factor parameter,Proc.CSEE 42 (4) (2022) 1523–1533,https://doi.org/10.13334/j.0258-8013.pcsee.202368.

[12]J.A.Lin,S.Ament,M.Balandat,D.Eriksson,J.M.Hernández-Lobato,E.Bakshy,Scalable Gaussian processes with latent kronecker structure,(n.d.).

[13]B.Zhang,Y.Xia,X.Peng,Robust optimal dispatch strategy of integrated energy system considering CHP-P2G-CCS,Global Energy Interconnect.7 (2024) 14–24,https://doi.org/10.1016/j.gloei.2024.01.002.

[14]F.Tian,Y.Jia,H.Ren,Y.Bai,T.Huang,“Source-load” lowcarbon economic dispatch of integrated energy system considering carbon capture system,Power Syst.Technol.44 (9) (2020) 3346–3354,https://doi.org/10.13335/j.1000-3673.pst.2020.0728.

[15]P.Li,H.Zhong,H.Ma,J.Li,Y.Liu,J.Wang,Multi-timescale optimal dispatch of source-load-storage coordination in active distribution network based on deep reinforcement learning,Trans.China Electrotech.Soc.40 (5) (2025) 1487–1502.

[16]Y.Li,C.Yu,M.Shahidehpour,T.Yang,Z.Zeng,T.Chai,Deep reinforcement learning for smart grid operations: algorithms,applications,and prospects,Proc.IEEE 111 (2023) 1055–1096,https://doi.org/10.1109/JPROC.2023.3303358.

[17]J.Ding,C.Liu,Y.Zheng,Y.Zhang,Z.Yu,R.Li,H.Chen,J.Piao,H.Wang,J.Liu,Y.Li,Artificial intelligence for complex network: potential,methodology and application,arXiv (2024),https://doi.org/10.48550/arXiv.2402.16887.

[18]D.Zhang,X.Han,C.Deng,Review on the research and practice of deep learning and reinforcement learning in smart grids,CSEE J.Power Energy Syst 4 (2018).

[19]L.-L.Xiong,S.Mao,Y.Tang,K.Meng,Z.-Y.Dong,F.Qian,Reinforcement learning based integrated energy system management: a survey,Zidonghua Xuebao/Acta Automat.Sin.47 (2021) 2321–2340,https://doi.org/10.16383/j.aas.c210166.

[20]H.Zheng,Y.Jiang,J.Dai,Optimization of voltage regulation and loss reduction of offshore wind power transmission based on multi-agent Q-learning in uncertain environment,Zhongguo Dianji Gongcheng Xuebao/Proc.Chinese Soc.Electr.Eng.44(2024) 7995–8008,https://doi.org/10.13334/j.0258-8013.pcsee.230562.

[21]W.Lin,X.Wang,Q.Sun,Z.Liu,J.He,T.Pu,Dynamic dispatch of an integrated energy system based on deep reinforcement learning in an uncertain environment,Dianli Xitong Baohu Yu Kongzhi/Power Syst.Protect.Control 50(2022)50–60,https://doi.org/10.19783/j.cnki.pspc.211685.

[22]W.Luo,J.Zhang,Y.He,T.Gu,X.Nie,L.Fan,X.Yuan,B.Li,Optimal scheduling of regional integrated energy system based on advantage learning soft actor-critic algorithm and transfer learning,Dianwang Jishu/Power Syst.Technol.47 (2023) 1601–1611,https://doi.org/10.13335/j.1000-3673.pst.2022.1241.

[23]M.Chen,Y.Sun,Z.Xie,The multi-time-scale management optimization method for park integrated energy system based on the bi-layer deep reinforcement learning,Diangong Jishu Xuebao/Trans.China Electrotech.Soc.38 (2023) 1864–1881,https://doi.org/10.19595/j.cnki.1000-6753.tces.211879.

[24]S.Jiang,H.Gao,X.Wang,J.Liu,K.Zuo,Deep reinforcement learning based multi-level dynamic reconfiguration for urban distribution network: a cloud-edge collaboration architecture,Global Energy Interconnect.6 (2023) 1–14,https://doi.org/10.1016/j.gloei.2023.02.001.

[25]Q.-Y.Fan,M.Cai,B.Xu,An improved prioritized DDPG based on fractional-ord er learning scheme,IEEE Trans.Neural Netw.Learn.Syst.36 (2025) 6873–6882,https://doi.org/10.1109/TNNLS.2024.3395508.

[26]Real-time optimal power flow using twin delayed deep deterministic policy gradient algorithm,IEEE J.Mag.,IEEE Xplore,(n.d.).https://ieeexplore.ieee.org/document/9272783(accessed June 17,2025).

[27]J.Li,T.Yu,X.Zhang,F.Li,D.Lin,H.Zhu,Efficient experience replay based deep deterministic policy gradient for AGC dispatch in integrated energy system,Appl.Energy 285 (2021),https://doi.org/10.1016/j.apenergy.2020.116386.

[28]O.Yazar,S.Coskun,L.Li,F.Zhang,C.Huang,Actor-critic TD3-based deep reinforcement learning for energy management strategy of HEV,in: 2023 5th International Congress on Human-Computer Interaction,Optimization and Robotic Applications(HORA),2023,pp.1–6,https://doi.org/10.1109/HORA58378.2023.10156727.

[29]X.Gu,T.Liu,S.Li,T.Wang,X.Yang,Active power correction control of power grid based on improved twin delayed deep deterministic policy gradient algorithm,Diangong Jishu Xuebao/Trans.China Electrotech.Soc.38 (2023) 2162–2177,https://doi.org/10.19595/j.cnki.1000-6753.tces.221073.

[30]B.Zhang,Y.Zou,X.Zhang,G.Du,W.Sun,W.Sun,Energy management strategy based on TD3-PER for hybrid electric tracked vehicle,Qiche Gongcheng/Automot.Eng.44 (2022)1400–1409,https://doi.org/10.19562/j.chinasae.qcgc.2022.09.011.

[31]G.Qiu,H.He,K.Liu,S.Luo,C.He,F.Shen,Low-carbon economic dispatch of integrated energy system based on improved TD3 algorithm,Electr.Power Sci.Eng.39 (10) (2023) 52–62.

[32]T.Schaul,J.Quan,I.Antonoglou,D.Silver,Prioritized experience replay,arXiv (2016),https://doi.org/10.48550/arXiv.1511.05952.

[33]S.Fujimoto,H.van Hoof,D.Meger,Addressing function approximation error in actor-critic methods,arXiv (2018),https://doi.org/10.48550/arXiv.1802.09477.

[34]C.Lyu,X.Liu,L.Mihaylova,Review of recent advances in Gaussian process regression methods,(n.d.).

[35]N.Huang,B.Qi,Z.Liu,G.Cai,E.Xing,Probabilistic short-term load forecasting using Gaussian process regression with area grey incidence decision making,Dianli Xitong Zidonghua/Automat.Electr.Power Syst.42 (2018) 64–71,https://doi.org/10.7500/AEPS20180118008.

[36]A.Patterson,S.Neumann,M.White,A.White,Empirical design in reinforcement learning,(n.d.).

Received 18 June 2025;revised 6 September 2025;accepted 11 October 2025

Peer review under the responsibility of Global Energy Interconnection Group Co.Ltd.

* Corresponding authors.

E-mail addresses: spxiang@hdu.edu.cn (S.Xiang),mohan_lin@hdu.edu.cn (M.Lin),zchen331@163.com (Z.Chen),plzeng@hotmail.com (P.Zeng),hfwangxj@126.com (X.Wang),diyang915@outlook.com (D.Gong).

https://doi.org/10.1016/j.gloei.2025.10.003

2096-5117/© 2025 Global Energy Interconnection Group Co.Ltd.Publishing services by Elsevier B.V.on behalf of KeAi Communications Co.Ltd.This is an open access article under the CC BY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).

12a9ba185494df7c13618f731e74e92a.jpg

Shenpeng Xiang received the B.S.degree from the Department of Mechanical and Electrical Engineering,City University of Wuhan,Wuhan,China,in 2022.He is working towards M.S degree at Electrical Engineering with Hangzhou Dianzi University,Hangzhou,China.His research interests include the application of artificial intelligence methods in microgrid scheduling.

  • 目录

    图1