Authors: Milad Hasanzadeh, Amin Kargarian
\textit{DPLib} is an open-source MATLAB-based benchmark library created to support research and development in distributed and decentralized power system analysis and optimization. Distributed and decentralized methods offer scalability, privacy preservation, and resilience to single points of failure, making them increasingly important for modern power systems. However, unlike centralized tools such as MATPOWER, no general-purpose, reproducible data library package currently exists for distributed power system studies. DPLib fills this gap by providing a standard power system library featuring over 20 multi-region benchmark test cases of varying sizes, along with a graph-based partitioning toolkit that decomposes any MATPOWER test system into multiple electrically coherent regions. The partitioning toolkit, an easy-to-use MATLAB code, generates standardized \texttt{.mat} and \texttt{.m} files, along with region visualizations for intuitive understanding. We also provide modular, easy-to-use distributed optimal power flow (OPF) solvers: an alternating direction method of multipliers(ADMM)-based DC-OPF solver implemented in YALMIP, and an ADMM-based AC-OPF solver leveraging IPOPT. These solvers validate the generated test systems for distributed optimization applications. Numerical results validate the generated test cases, establishing DPLib as a foundation for reproducible distributed power system research.
Authors: Akash Mahajan, Shivam Chaturvedi, Srijita Das, Wencong Su, Van-Hai Bui
The selection of optimal design for power electronic converter parameters involves balancing efficiency and thermal constraints to ensure high performance without compromising safety. This paper introduces a probabilistic-learning-based stochastic surrogate modeling framework to address this challenge and significantly reduce the time required during the design phase. The approach begins with a neural network classifier that evaluates the feasibility of parameter configurations, effectively filtering out unsafe and/or impractical inputs. Subsequently, a probabilistic prediction model estimates the converter's efficiency and temperature while quantifying prediction uncertainty, providing both performance insights and reliability metrics. Finally, a heuristic optimization-based model is employed to optimize a multi-objective function that maximizes efficiency while adhering to thermal constraints. The optimization process incorporates penalty terms to discourage solutions that violate practical thresholds, ensuring actionable and realistic recommendations. An advanced heuristic optimization method is used to find the optimal solution and is compared with several well-known search algorithms, including Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Simulated Annealing (SA), Tabu-Search (TS), and Stochastic Hill Climbing (SHC). The results demonstrate significant improvements in predictive accuracy and optimization outcomes, offering a robust solution for advancing power electronics design.
Authors: Mohini Bariya, Genevieve Flaspohler
Electric grids in low- and middle-income countries (LMICs) across the world face an acute challenge. To support global decarbonisation efforts and raise millions from energy poverty, these grids must shoulder substantial load growth while integrating distributed renewable generation. However, decades of rapid and poorly funded infrastructure expansions have led to national grids in many LMICs that are strained and weak, composed of aging, faulty, and undersized infrastructure. A cause and symptom of this weakness is excessive technical loss within the grid infrastructure during energy delivery, particularly at the distribution level; network losses are regularly estimated to be well over 20 percent, compared to a baseline of 5 percent in higher-income nations. Addressing technical loss through targeted interventions is essential for bolstering grids' physical and economic strength. Unfortunately, current approaches for estimating and localizing technical loss require expensive, extensive power flow sensing, which is essentially absent in LMIC distribution systems. We present a novel approach to technical loss estimation without power flows, which leverages more readily available voltage magnitude measurements at sparse locations in the grid. This estimator puts loss estimation and localization within reach for LMIC grids globally, and provides a critical tool for the effective design, implementation, and evaluation of loss-reduction interventions.
Authors: Ruixiao Yang, Gulai Shen, Ahmed S. Alahmed, Chuchu Fan
We study the joint scheduling of behind-the-meter distributed energy resources (DERs), including flexible loads, renewable generation, and battery energy storage systems, under net energy metering frameworks with demand charges. The problem is formulated as a stochastic dynamic program aimed at maximizing expected operational surplus while accounting for renewable generation uncertainty. We analytically characterize the structure of the optimal control policy and show that it admits a threshold-based form. However, due to the strong temporal coupling of the storage and demand charge constraints, the number of conditional branches in the policy scales combinatorially with the scheduling horizon, as it requires a look-ahead over future states. To overcome the high computational complexity in the general formulation, an efficient approximation algorithm is proposed, which searches for the peak demand under a mildly relaxed problem. We show that the algorithm scales linearly with the scheduling horizon. Extensive simulations using two open-source datasets validate the proposed algorithm and compare its performance against different DER control strategies, including a reinforcement learning-based one. Under varying storage and tariff parameters, the results show that the proposed algorithm outperforms various benchmarks in achieving a relatively small solution gap compared to the theoretical upper bound.
Authors: Tong Su, Tong Wu, Junbo Zhao, Anna Scaglione, Le Xie
Given the availability of more comprehensive measurement data in modern power systems, reinforcement learning (RL) has gained significant interest in operation and control. Conventional RL relies on trial-and-error interactions with the environment and reward feedback, which often leads to exploring unsafe operating regions and executing unsafe actions, especially when deployed in real-world power systems. To address these challenges, safe RL has been proposed to optimize operational objectives while ensuring safety constraints are met, keeping actions and states within safe regions throughout both training and deployment. Rather than relying solely on manually designed penalty terms for unsafe actions, as is common in conventional RL, safe RL methods reviewed here primarily leverage advanced and proactive mechanisms. These include techniques such as Lagrangian relaxation, safety layers, and theoretical guarantees like Lyapunov functions to rigorously enforce safety boundaries. This paper provides a comprehensive review of safe RL methods and their applications across various power system operations and control domains, including security control, real-time operation, operational planning, and emerging areas. It summarizes existing safe RL techniques, evaluates their performance, analyzes suitable deployment scenarios, and examines algorithm benchmarks and application environments. The paper also highlights real-world implementation cases and identifies critical challenges such as scalability in large-scale systems and robustness under uncertainty, providing potential solutions and outlining future directions to advance the reliable integration and deployment of safe RL in modern power systems.
Authors: Michał Forystek, Andrew D. Syrmakesis, Alkistis Kontou, Panos Kotsampopoulos, Nikos D. Hatziargyriou, Charalambos Konstantinou
The modern power grid increasingly depends on advanced information and communication technology (ICT) systems to enhance performance and reliability through real-time monitoring, intelligent control, and bidirectional communication. However, ICT integration also exposes the grid to cyber-threats. Load altering attacks (LAAs), which use botnets of high-wattage devices to manipulate load profiles, are a notable threat to grid stability. While previous research has examined LAAs, their specific impact on load frequency control (LFC), critical for maintaining nominal frequency during load fluctuations, still needs to be explored. Even minor frequency deviations can jeopardize grid operations. This study bridges the gap by analyzing LAA effects on LFC through simulations of static and dynamic scenarios using Python and RTDS. The results highlight LAA impacts on frequency stability and present an eigenvalue-based stability assessment for dynamic LAAs (DLAAs), identifying key parameters influencing grid resilience.
Authors: Oktay Karakuş, Padraig Corcoran
Electric vehicle (EV) charging infrastructure is increasingly critical to sustainable transport systems, yet its resilience under environmental and infrastructural stress remains underexplored. In this paper, we introduce RSERI-EV, a spatially explicit and multi-modal risk assessment framework that combines remote sensing data, open infrastructure datasets, and spatial graph analytics to evaluate the vulnerability of EV charging stations. RSERI-EV integrates diverse data layers, including flood risk maps, land surface temperature (LST) extremes, vegetation indices (NDVI), land use/land cover (LULC), proximity to electrical substations, and road accessibility to generate a composite Resilience Score. We apply this framework to the country of Wales EV charger dataset to demonstrate its feasibility. A spatial $k$-nearest neighbours ($k$NN) graph is constructed over the charging network to enable neighbourhood-based comparisons and graph-aware diagnostics. Our prototype highlights the value of multi-source data fusion and interpretable spatial reasoning in supporting climate-resilient, infrastructure-aware EV deployment.
Authors: Soufiane El Yaagoubi, Keith Moffat, Eduardo Prieto Araujo, Florian Dörfler
Future electrical grids will require new ways to identify faults as inverters are not capable of supplying large fault currents to support existing fault detection methods and because distributed resources may feed faults from the edge of the grid. This paper proposes the use of real-time system identification for online power-system fault detection. Specifically, we implement Recursive ARX (rARX) system identification on a grid-connected inverter. Experiments demonstrate that the proposed rARX method is able to both detect large faults quickly, and distinguish between high-impedance faults and large load increases. These results indicate that rARX grid-edge fault detection is a promising research direction for improving the reliability and safety of modern electric grids.
Authors: Muhammad Zeeshan Mumtaz, Mohammadali Mohammadi, Hien Quoc Ngo, Michail Matthaiou
This paper investigates the simultaneous wireless information and power transfer (SWIPT) capability of a modular extremely large multiple-input multiple-output (XL-MIMO) system, in the context of power consumption (PC) efficiency. The network users are divided into two functional categories: information decoding (ID) users and energy harvesting (EH) users. Non-stationary near-field channels are considered whilst the users are located in spatially distinct visibility regions (VRs). We formulate a two-tier joint optimization problem to minimize the PC, taking into account the power allocation (PA) for ID and EH users, along with the activation of constituent XL-MIMO subarrays. This complicated mixed-integer problem is transformed into more tractable formulations and efficient algorithms are proposed for solving them. The numerical results demonstrate that the overall PC of the XL-MIMO system for the proposed method is reduced by more than 60% in comparison to the benchmark scheme of equal PA with full subarray activation (SA) and 30% against the case of optimized PA with full SA, while satisfying the quality-of-service (QoS) constraints on both the downlink rate of the ID users and harvested energy at the EH users.
Authors: Muhammad Zeeshan Mumtaz, Mohammadali Mohammadi, Hien Quoc Ngo, Michail Matthaiou
This paper explores the maximization of the harvested power efficiency (HPE) in a modular extremely large multiple-input multiple-output (XL-MIMO) system, which supports energy harvesting (EH) for near-field users. These users are located in spatially distinct visibility regions (VRs) with non-stationary channel characteristics. We propose to determine which sub-arrays are switched on or off as well the power control coefficients at the sub-arrays to maximize the HPE. The design can be processed via a multi-tier joint optimization framework based on fractional programming. The numerical results showcase that the HPE performance of the proposed algorithm is nearly optimal, comparable to that of exhaustive search. As a matter of fact, it achieves up to a 120% gain over the benchmark scheme which uses the entire XL-MIMO array with equal power allocation (PA) across sub-arrays, while significantly reducing the computational time.
Authors: Dong Liu, Sander Timmerman, Yu Xiang, Peter Palensky, Pedro P. Vergara
This paper introduces a data-driven topology identification and correction approach for low-voltage distribution networks (LVDNs) combined with a time-based smart meter data selection strategy, aiming to correct outdated recordings and identify the missed recordings. The proposed approach solely relies on voltage magnitude measurements, releasing privacy concerns and measurement burdens. It enables the distribution system operators to identify switch states through supervised learning algorithms, as well as determine user-feeder connections and phase labels of customers by a modified Hierarchical Clustering algorithm. To address the similarity among smart meter (SM) data caused by distributed photovoltaic (PV) systems, a time-based SM data selection strategy is combined with the proposed correlation analysis. The feasibility and robustness of the proposed approach are validated using modified real-world LVDNs and multiple incomplete SM datasets collected from customers in the Netherlands. The results demonstrate that the time-based SM data selection strategy effectively mitigates their impact on phase identification, and the corrected topology not only improves network observability but also supports network operators in load balancing and PV consumption.
Authors: Yun Xu, Yunxiao Bai, Yunyong Zhang, Peng Wang, Xuelin Wang, Jiqun Guo, Kaijun Xie, Rusheng Zhao
The growing integration of renewable energy sources necessitates adequate reserve capacity to maintain power balance. However, in market clearing, power companies with flexible resources may submit strategic bids to maximize profits, potentially compromising system reserves. This paper examines the effects of such strategic behavior by modeling the market as a bi-level problem. The upper level represents a strategic company aiming to maximize profit, while the lower level simulates the system operator clearing the market based on submitted offers. To enable duality-based solution methods, we approximate unit commitments with a continuous reserve capacity calculation. Case studies indicate that, in an imperfectly competitive market, more units are incentivized to operate,enhancing system reserves. However, some units go online mainly for profit, ultimately raising electricity costs for consumers. These findings highlight the importance of market design in managing the trade-off between reserve adequacy and economic efficiency in the presence of strategic bidding behavior.
Authors: Andrew Mole, Max Weissenbacher, Georgios Rigas, Sylvain Laizet
Traditional wind farm control operates each turbine independently to maximize individual power output. However, coordinated wake steering across the entire farm can substantially increase the combined wind farm energy production. Although dynamic closed-loop control has proven effective in flow control applications, wind farm optimization has relied primarily on static, low-fidelity simulators that ignore critical turbulent flow dynamics. In this work, we present the first reinforcement learning (RL) controller integrated directly with high-fidelity large-eddy simulation (LES), enabling real-time response to atmospheric turbulence through collaborative, dynamic control strategies. Our RL controller achieves a 4.30% increase in wind farm power output compared to baseline operation, nearly doubling the 2.19% gain from static optimal yaw control obtained through Bayesian optimization. These results establish dynamic flow-responsive control as a transformative approach to wind farm optimization, with direct implications for accelerating renewable energy deployment to net-zero targets.
Authors: Verena Häberle, Xiuqiang He, Linbin Huang, Florian Dörfler, Steven Low
We propose a decentralized framework for guaranteeing the small-signal stability of future power systems with grid-forming converters. Our approach leverages dynamic loop-shifting techniques to compensate for the lack of passivity in the network dynamics and establishes decentralized parametric stability certificates, depending on the local device-level controls and incorporating the effects of the network dynamics. By following practical tuning rules, we are able to ensure plug-and-play operation without centralized coordination. Unlike prior works, our approach accommodates coupled frequency and voltage dynamics, incorporates network dynamics, and does not rely on specific network configurations or operating points, offering a general and scalable solution for the integration of power-electronics-based devices into future power systems. We validate our theoretical stability results through numerical case studies in a high-fidelity simulation model.
Authors: Ignasi Ventura Nadal, Jochen Stiasny, Spyros Chatzivasileiadis
Time-domain simulations are crucial for ensuring power system stability and avoiding critical scenarios that could lead to blackouts. The next-generation power systems require a significant increase in the computational cost and complexity of these simulations due to additional degrees of uncertainty, non-linearity and states. Physics-Informed Neural Networks (PINN) have been shown to accelerate single-component simulations by several orders of magnitude. However, their application to current time-domain simulation solvers has been particularly challenging since the system's dynamics depend on multiple components. Using a new training formulation, this paper introduces the first natural step to integrate PINNs into multi-component time-domain simulations. We propose PINNs as an alternative to other classical numerical methods for individual components. Once trained, these neural networks approximate component dynamics more accurately for longer time steps. Formulated as an implicit and consistent method with the transient simulation workflow, PINNs speed up simulation time by significantly increasing the time steps used. For explanation clarity, we demonstrate the training, integration, and simulation framework for several combinations of PINNs and numerical solution methods using the IEEE 9-bus system, although the method applies equally well to any power system size.
Authors: Maria Patrou, Thomas Wang, Wael Elwasif, Markus Eisenbach, Ross Miller, William Godoy, Oscar Hernandez
With high-performance computing systems now running at exascale, optimizing power-scaling management and resource utilization has become more critical than ever. This paper explores runtime power-capping optimizations that leverage integrated CPU-GPU power management on architectures like the NVIDIA GH200 superchip. We evaluate energy-performance metrics that account for simultaneous CPU and GPU power-capping effects by using two complementary approaches: speedup-energy-delay and a Euclidean distance-based multi-objective optimization method. By targeting a mostly compute-bound exascale science application, the Locally Self-Consistent Multiple Scattering (LSMS), we explore challenging scenarios to identify potential opportunities for energy savings in exascale applications, and we recognize that even modest reductions in energy consumption can have significant overall impacts. Our results highlight how GPU task-specific dynamic power-cap adjustments combined with integrated CPU-GPU power steering can improve the energy utilization of certain GPU tasks, thereby laying the groundwork for future adaptive optimization strategies.
Authors: Eunhyuk Park, Seok-Hwan Park, Osvaldo Simeone, Marco Di Renzo, Shlomo Shamai
As the dense deployment of access points (APs) in cell-free massive multiple-input multiple-output (CF-mMIMO) systems presents significant challenges, per-AP coverage can be expanded using large-scale antenna arrays (LAAs). However, this approach incurs high implementation costs and substantial fronthaul demands due to the need for dedicated RF chains for all antennas. To address these challenges, we propose a hybrid beamforming framework that integrates wave-domain beamforming via stacked intelligent metasurfaces (SIM) with conventional digital processing. By dynamically manipulating electromagnetic waves, SIM-equipped APs enhance beamforming gains while significantly reducing RF chain requirements. We formulate a joint optimization problem for digital and wave-domain beamforming along with fronthaul compression to maximize the weighted sum-rate for both uplink and downlink transmission under finite-capacity fronthaul constraints. Given the high dimensionality and non-convexity of the problem, we develop alternating optimization-based algorithms that iteratively optimize digital and wave-domain variables. Numerical results demonstrate that the proposed hybrid schemes outperform conventional hybrid schemes, that rely on randomly set wave-domain beamformers or restrict digital beamforming to simple power control. Moreover, the proposed scheme employing sufficiently deep SIMs achieves near fully-digital performance with fewer RF chains in most simulated cases, except in the downlink at low signal-to-noise ratios.
Authors: Zeinab Salehi, Yijun Chen, Ian R. Petersen, Guodong Shi, Duncan S. Callaway, Elizabeth L. Ratnam
The recent widespread adoption of rooftop solar backed by battery storage is enabling energy customers to both produce and consume electricity (i.e., prosumers of electricity). To facilitate prosumer participation in the electric grid, new market mechanisms are required. In this paper, we design peer-to-peer energy markets where prosumers trade their excess energy with peers to gain profit while satisfying the overall balance in electricity supply and demand. We first consider a market structure, considering the case where voltage and/or thermal constraints are binding. When such grid constraints are binding, market clearing prices can vary across locations. However, heterogeneous prices may be considered by regulators to lack fairness. To ensure uniform pricing, we design two peer-to-peer energy markets with dynamic operating envelopes (DOEs). DOEs enable us to decompose global voltage and thermal constraints across the power grid into local constraints for each prosumer, resulting in uniform prices across the grid. By means of numerical simulations on an IEEE 13-node feeder, we benchmark the proposed market-based approaches in the presence of binding voltage constraints.
Authors: Sarra Bouchkati, Ramil Sabirov, Steffen Kortmann, Andreas Ulbig
This paper introduces an efficient Residual Reinforcement Learning (RRL) framework for voltage control in active distribution grids. Voltage control remains a critical challenge in distribution grids, where conventional Reinforcement Learning (RL) methods often suffer from slow training convergence and inefficient exploration. To overcome these challenges, the proposed RRL approach learns a residual policy on top of a modified Sequential Droop Control (SDC) mechanism, ensuring faster convergence. Additionally, the framework introduces a Local Shared Linear (LSL) architecture for the Q-network and a Transformer-Encoder actor network, which collectively enhance overall performance. Unlike several existing approaches, the proposed method relies solely on inverters' measurements without requiring full state information of the power grid, rendering it more practical for real-world deployment. Simulation results validate the effectiveness of the RRL framework in achieving rapid convergence, minimizing active power curtailment, and ensuring reliable voltage regulation.
Authors: Jovan Krajacic, Keith Moffat, Gustavo Valverde
As power systems evolve with increasing production from Inverter-Based Resources (IBRs), their underlying dynamics are undergoing significant changes that can jeopardize system operation, leading to poorly damped oscillations or small-signal rotor angle instability. In this work, we investigate whether Power System Stabilizer (PSS) setting adjustments can effectively restore system stability and provide adequate damping in systems with increased IBR penetration, using the benchmark Kundur Two-Area System as a case study. Specifically, we evaluate the model-based Residues and P-Vref PSS tuning methods to examine their effectiveness under evolving grid conditions. Our findings indicate that the effectiveness of these tuning methods is not guaranteed, particularly when coordination is limited. Consequently, our case study motivates local and adaptive online PSS tuning methods.
Authors: Youcefa Brahim Elkhalil, Nima Tashakor, Davood Keshavarzi, Ehsan Asadi, Stefan Goetz
During grid faults, grid-forming converters are typically suggested to switch from a voltage-source to a current-source mode to limit the current and protect the electronics. This transition has the potential for the converter to transiently lose synchronization due to such current saturation. Therefore, this paper proposes an alternative current saturation algorithm to improve transient synchronization stability during mode switching. The algorithm is designed for grid-forming converters to meet low-voltage ride-through (LVRT) requirements and grid-fault standards in addition to transient synchronization stability. Moreover, it limits the converter output current during grid faults with a new control parameter. The presented method introduces converter output virtual fluxes to calculate the current references in the d- and q-axes for the current saturation algorithm to enhance LVRT performance and grid stability. The method exploits the correlation between the converter's virtual fluxes and currents to modify the current saturation levels through real-time converter virtual flux estimation. The adaptive saturation levels ensure precise control and high dynamics during grid faults and facilitate optimal power injection or absorption to support the grid. The proposed current-saturation algorithm is analytically evaluated. Further, hardware-in-the-loop (HIL) experiments validate the effectiveness of the proposed algorithm.
Authors: Yulin Liu, Zhaojun Ruan, Libao Shi
Owing to the advanced communication networks and intelligent electronic devices, the cyber-physical distribution systems (CPDSs) possess the capability to perform flexible economic dispatch and achieve rapid self-healing from extreme events. Meanwhile, the deep integration of cyber and physical systems makes CPDS vulnerable to coordinated cyber-physical attacks. In this paper, a resilience assessment framework for the CPDS under coordinated cyber-physical attacks is proposed to investigate the impact of the coordinated attacks on load loss and service restoration in CPDS. First, a three-stage defender-attacker-defender dynamic game model considering fake base station (FBS) and physical attacks for CPDS is established, aiming at seeking the optimal defense resource deployment strategy to enhance the resilience of the CPDS. The physical attack is launched to cause faults on the power lines, and the FBS attack is employed to interrupt the service of wireless cellular network to hinder the self-healing process of the CPDS. The lognormal shadowing model and search theory are applied to quantitatively describe the process of the coordinated cyber-physical attacks. Further, the constructed three-stage dynamic game model is equivalently recast as a tri-level max-min-max optimization model, which is solved using column-and-constraint generation combined with enumeration method. Finally, the effectiveness of the proposed resilience assessment framework and solution strategy is demonstrated by conducting simulation analysis on the modified IEEE 33-node CPDS and a real-world 47-node CPDS in China.
Authors: Young-ho Cho, Min-Seung Ko, Hao Zhu
A sustainable electricity infrastructure requires the explicit integration of carbon emissions into power system modeling and optimization paradigms. However, existing open-source datasets for power system R&D lack generator-level carbon emission profiling, limiting the ability to benchmark and compare various carbon-aware grid operational strategies. To address this gap, this work introduces PGLib-CO2, an open-source extension to the widely adopted PGLib-OPF test case library. PGLib-CO2 enriches standard network cases with CO2 and CO2-equivalent emission intensity factors by expanding the fuel-type categorization used by PGLib-OPF, attaining a realistic generator-level carbon profiling. It is also packaged for both Python's pandapower and Julia's this http URL, for a seamless, user-friendly integration of emission modeling into grid computation and optimization tasks. The dataset produced by PGLib-CO2 can support grid-based carbon accounting, emission metric evaluation, and integration into AC optimal power flow (OPF) and optimal load shifting (OLS) formulations. We demonstrate PGLib-CO2's utility through case studies that quantify cost-emission trade-offs and optimize a carbon-aware objective function. By standardizing carbon-enhanced test cases, PGLib-CO2 provides an open-source, reproducible foundation for benchmarking carbon-aware computation, facilitating future research in sustainable power system operation.
Authors: Agrim Gupta, Adel Heidari, Jiaming Jin, Dinesh Bharadia
Connectivity on-the-go has been one of the most impressive technological achievements in the 2010s decade. However, multiple studies show that this has come at an expense of increased carbon footprint, that also rivals the entire aviation sector's carbon footprint. The two major contributors of this increased footprint are (a) smartphone batteries which affect the embodied footprint and (b) base-stations that occupy ever-increasing energy footprint to provide the last mile wireless connectivity to smartphones. The root-cause of both these turn out to be the same, which is communicating over the last-mile lossy wireless medium. We show in this paper, titled DensQuer, how base-station densification, which is to replace a single larger base-station with multiple smaller ones, reduces the effect of the last-mile wireless, and in effect conquers both these adverse sources of increased carbon footprint. Backed by a open-source ray-tracing computation framework (Sionna), we show how a strategic densification strategy can minimize the number of required smaller base-stations to practically achievable numbers, which lead to about 3x power-savings in the base-station network. Also, DensQuer is able to also reduce the required deployment height of base-stations to as low as 15m, that makes the smaller cells easily deployable on trees/street poles instead of requiring a dedicated tower. Further, by utilizing newly introduced hardware power rails in Google Pixel 7a and above phones, we also show that this strategic densified network leads to reduction in mobile transmit power by 10-15 dB, leading to about 3x reduction in total cellular power consumption, and about 50% increase in smartphone battery life when it communicates data via the cellular network.
Authors: Jasmin Y. Lim, Dimitrios Pylorof, Humberto E. Garcia, Karthik Duraisamy
Generation IV (Gen-IV) nuclear power plants are envisioned to replace the current reactor fleet, bringing improvements in performance, safety, reliability, and sustainability. However, large cost investments currently inhibit the deployment of these advanced reactor concepts. Digital twins bridge real-world systems with digital tools to reduce costs, enhance decision-making, and boost operational efficiency. In this work, a digital twin framework is designed to operate the Gen-IV Fluoride-salt-cooled High-temperature Reactor, utilizing data-enhanced methods to optimize operational and maintenance policies while adhering to system constraints. The closed-loop framework integrates surrogate modeling, reinforcement learning, and Bayesian inference to streamline end-to-end communication for online regulation and self-adjustment. Reinforcement learning is used to consider component health and degradation to drive the target power generations, with constraints enforced through a Reference Governor control algorithm that ensures compliance with pump flow rate and temperature limits. These input driving modules benefit from detailed online simulations that are assimilated to measurement data with Bayesian filtering. The digital twin is demonstrated in three case studies: a one-year long-term operational period showcasing maintenance planning capabilities, short-term accuracy refinement with high-frequency measurements, and system shock capturing that demonstrates real-time recalibration capabilities when change in boundary conditions. These demonstrations validate robustness for health-aware and constraint-informed nuclear plant operation, with general applicability to other advanced reactor concepts and complex engineering systems.
Authors: Jun Wook Heo, Raja Jurdak, Sara Khalifa
High penetration of Photovoltaic (PV) generation and Battery Energy Storage System (BESS) in individual households increases the demand for solutions to determine the optimal PV generation power and the capacity of BESS. Self-consumption and self-sufficiency are essential for optimising the operation of PV-BESS systems in households, aiming to minimise power import from and export to the main grid. However, self-consumption and self-sufficiency are not independent; they share a linear relationship. This paper demonstrates this relationship and proposes an optimal operating strategy that considers power generation and consumption profiles to maximise self-consumption and self-sufficiency in households equipped with a PV-BESS. We classify self-consumption and self-sufficiency patterns into four categories based on the ratio of self-sufficiency to self-consumption for each household and determine the optimal PV generation and BESS capacities using both a mathematical calculation and this ratio. These optimal operation values for each category are then simulated using Model Predictive Control (MPC) and Reinforcement Learning (RL)-based battery charging and discharging scheduling models. The results show that the ratio between self-consumption and self-sufficiency is a useful metric for determining the optimal capacity of PV-BESS systems to maximise the local utilisation of PV-generated power.
Authors: Guglielmo D'Amico, Filippo Petroni
The Rate of Occurrence of Failures (ROCOF) is a widely utilized indicator for assessing a system's performance over time, yet it does not fully disclose the instantaneous behavior of a system. This paper introduces new measures to complement the ROCOF, providing a more comprehensive understanding of system reliability, particularly for Markov systems. We define the Rate of Occurrence of Repairs (ROCOR), which quantifies the system's instantaneous tendency to transition from failure to working states, and the Rate of Inoccurrence (ROI), which measures the propensity to remain within the current subset of states (either working or failure) without transitioning out. Explicit expressions for the computation of these rates are derived for Markov systems. Furthermore, a Total Mobility Rate (TMR) is proposed, integrating these individual rates to capture the overall dynamism of the system. The utility of these new indicators is demonstrated through a significant real-world application to wind farm management. The results from the wind farm study show that ROCOR, ROI, and TMR, when used in conjunction with ROCOF, reveal nuanced operational dynamics and reliability characteristics that are not discernible from static measures like Weibull parameters or ROCOF alone. These indicators can distinguish between sites with similar long-term wind profiles by identifying different "reliability logics," such as persistence-driven versus transition-driven behaviors. This enriched, time-dependent perspective provides valuable information for maintenance scheduling, operational strategies, and risk assessment, ultimately enhancing the ability to manage complex systems effectively.
Authors: Ali Peivandizadeh
The explosive growth of artificial intelligence has created gigawatt-scale data centers that fundamentally challenge power system operation, exhibiting power fluctuations exceeding 500 MW within seconds and millisecond-scale variations of 50-75% of thermal design power. This paper presents a comprehensive theoretical framework that reconceptualizes Virtual Power Plants (VPPs) to accommodate these extreme dynamics through a four-layer hierarchical control architecture operating across timescales from 100 microseconds to 24 hours. We develop control mechanisms and stability criteria specifically tailored to converter-dominated systems with pulsing megawatt-scale loads. We prove that traditional VPP architectures, designed for aggregating distributed resources with response times of seconds to minutes, cannot maintain stability when confronted with AI data center dynamics exhibiting slew rates exceeding 1,000 MW/s at gigawatt scale. Our framework introduces: (1) a sub-millisecond control layer that interfaces with data center power electronics to actively dampen power oscillations; (2) new stability criteria incorporating protection system dynamics, demonstrating that critical clearing times reduce from 150 ms to 83 ms for gigawatt-scale pulsing loads; and (3) quantified flexibility characterization showing that workload deferability enables 30% peak reduction while maintaining AI service availability above 99.95%. This work establishes the mathematical foundations necessary for the stable integration of AI infrastructure that will constitute 50-70% of data center electricity consumption by 2030.
Authors: Ali Chouman (UGA, CSTB), Peter Riederer (CSTB), Frédéric Wurtz (UGA)
Climate change poses a serious threat to the Earth's ecosystems, fueled primarily by escalating greenhouse gas emissions. Among the main contributors, the building sector stands out due to its significant energy demand. Addressing this challenge requires innovative techniques in the control of energy systems in buildings. This paper deals with the formulation of a methodology designed to evaluate the performance of these controllers. The evaluation process involves the establishment of a comprehensive test protocol and a diverse set of scenarios to evaluate the controllers. Key performance indicators are used to quantify their effectiveness based on the test results. A practical case study is presented as an application to introduce this methodology, focusing on the integration of Model Predictive Controllers (MPCs) with the Dimosim thermal simulation platform. The digital twin of the Greener building in Grenoble is used as a model for emulation. The paper demonstrates the ability of the proposed methodology to test and rank MPCs in different test scenarios, providing valuable feedback on their performance capabilities. The paper highlights the importance of the developed approach in systematically evaluating and ranking MPCs for optimized building energy management.
Authors: Fangwei Cheng, Qian Luo, Jesse Jenkins
As the share of variable renewable energy in power systems grows, enhancing the operational flexibility of combined cycle gas turbines with carbon capture and storage (CCGT-CCS) becomes increasingly valuable. This study integrates techno-economic analysis with capacity expansion modeling to quantify the value of improved CCGT-CCS flexibility-such as lower start-up costs, reduced minimum generation, faster ramping, and shorter up/down times-at both plant and system levels. Using the Texas power system as a case study, we find that increased flexibility raises CCGT-CCS generation profits and installed capacity. Under various policy scenarios, CCGT-CCS benefits most from a CO2 tax (or equivalent emissions cap), more so than from clean energy standards or capture subsidies like the federal 45Q tax credit. However, electricity system cost savings remain modest, reducing total costs by only 0.3-0.5%. Thus, flexibility improvements should be pursued only if they entail limited increases in capital and maintenance costs.
Authors: Shenglu Wang, Kairui Feng, Mengqi Xue, Yue Song
The chance constrained optimal power flow (CC-OPF) essentially finds the low-cost generation dispatch scheme ensuring operational constraints are met with a specified probability, termed the security level. While the security level is a crucial input parameter, how it shapes the CC-OPF feasibility boundary has not been revealed. Changing the security level from a parameter to a decision variable, this letter proposes the inverse CC-OPF that seeks the highest feasible security level supported by the system. To efficiently solve this problem, we design a Newton-Raphson-like iteration algorithm leveraging the duality-based sensitivity analysis of an associated surrogate problem. Numerical experiments validate the proposed approach, revealing complex feasibility boundaries for security levels that underscore the importance of coordinating security levels across multiple chance constraints.
Authors: Yihong Zhou, Hanbin Yang, Thomas Morstyn
This paper proposes a Faster Inner Convex Approximation (FICA) method for solving power system dispatch problems with Wasserstein distributionally robust joint chance constraints (WJCC) and incorporating the modelling of the automatic generation control factors. The problem studied belongs to the computationally challenging class of WJCC with left-hand-side uncertainty (LHS-WJCC). By exploiting the special one-dimensional structure (even if only partially present) of the problem, the proposed FICA incorporates a set of strong valid inequalities to accelerate the solution process. We prove that FICA achieves the same optimality as the well-known conditional value-at-risk (CVaR) inner convex approximation method. Our numerical experiments demonstrate that the proposed FICA can yield 40x computational speedup compared to CVaR, and can even reach up to 500x speedup when the optimisation horizon exceeds 16 time steps. This speedup is achieved when only 50% of constraints in a WJCC have the one-dimensional structure. The approximation quality is numerically verified to be the same as CVaR, and the quality gap is below 1% when compared to the computationally demanding exact reformulation of the LHS-WJCC in most cases. We also discuss the applications of FICA in optimisation problems from other domains that (partially) exhibit the one-dimensional structure.
Authors: Basma Jumaa Saleh, Zaid Omar, Vikrant Bhateja, Lila Iznita Izhar
During the COVID-19 pandemic, medical imaging techniques like computed tomography (CT) scans have demonstrated effectiveness in combating the rapid spread of the virus. Therefore, it is crucial to conduct research on computerized models for the detection of COVID-19 using CT imaging. A novel processing method has been developed, utilizing radiomic features, to assist in the CT-based diagnosis of COVID-19. Given the lower specificity of traditional features in distinguishing between different causes of pulmonary diseases, the objective of this study is to develop a CT-based radiomics framework for the differentiation of COVID-19 from other lung diseases. The model is designed to focus on outlining COVID-19 lesions, as traditional features often lack specificity in this aspect. The model categorizes images into three classes: COVID-19, non-COVID-19, or normal. It employs enhancement auto-segmentation principles using intensity dark channel prior (IDCP) and deep neural networks (ALS-IDCP-DNN) within a defined range of analysis thresholds. A publicly available dataset comprising COVID-19, normal, and non-COVID-19 classes was utilized to validate the proposed model's effectiveness. The best performing classification model, Residual Neural Network with 50 layers (Resnet-50), attained an average accuracy, precision, recall, and F1-score of 98.8%, 99%, 98%, and 98% respectively. These results demonstrate the capability of our model to accurately classify COVID-19 images, which could aid radiologists in diagnosing suspected COVID-19 patients. Furthermore, our model's performance surpasses that of more than 10 current state-of-the-art studies conducted on the same dataset.
Authors: Junzhe Shi, Ulf Jakob Flø Aarsnes, Shengyu Tao, Ruiting Wang, Dagfinn Nærheim, Scott Moura
Fuel cell (FC)/battery hybrid systems have attracted substantial attention for achieving zero-emissions buses, trucks, ships, and planes. An online energy management system (EMS) is essential for these hybrid systems, it controls energy flow and ensures optimal system performance. Key aspects include fuel efficiency and mitigating FC and battery degradation. This paper proposes a health-aware EMS for FC and battery hybrid systems with multiple FC stacks. The proposed EMS employs mixed integer quadratic programming (MIQP) to control each FC stack in the hybrid system independently, i.e., MIQP-based individual stack control (ISC), with significant fuel cost reductions, FC and battery degradations. The proposed method is compared with classical dynamic programming (DP), with a 2243 times faster computational speed than the DP method while maintaining nearoptimal performance. The case study results show that ISC achieves a 64.68 % total cost reduction compared to CSC in the examined scenario, with substantial reductions across key metrics including battery degradation (4 %), hydrogen fuel consumption (22 %), fuel cell idling loss (99 %), and fuel cell load-change loss (41 %)
Authors: Shimin Wang, Martin Guay, Richard D. Braatz
This article addresses the nonadaptive and robust output regulation problem of the general nonlinear output feedback system with error output. The global robust output regulation problem for a class of general output feedback nonlinear systems with an uncertain exosystem and high relative degree can be tackled by constructing a linear generic internal model, provided that a continuous nonlinear mapping exists. Leveraging the proposed nonadaptive framework facilitates the conversion of the nonlinear robust output regulation problem into a robust nonadaptive stabilization formulation for the augmented system endowed with Input-to-State Stable dynamics. This approach removes the need for constructing a specific Lyapunov function with positive semi-definite derivatives and avoids the common assumption of linear parameterization of the nonlinear system. The nonadaptive approach is extended by incorporating the nonparametric learning framework to ensure the feasibility of the nonlinear mapping, which can be tackled using a data-driven method. Moreover, the introduced nonparametric learning framework allows the controlled system to learn the dynamics of the steady-state input behaviour from the signal generated from the internal model with the output error as the feedback. As a result, the nonadaptive/nonparametric approach can be advantageous to guarantee the convergence of the estimation and tracking error even when the underlying controlled system dynamics are complex or poorly understood. The effectiveness of the theoretical results is illustrated for a benchmark example: a controlled duffing system and two practical examples: a continuously stirred tank reactor and a continuous bioreactor.
Authors: Zeno Woywood, Jasper I. Wiltfang, Julius Luy, Tobias Enders, Maximilian Schiffer
We study a sequential decision-making problem for a profit-maximizing operator of an autonomous mobility-on-demand system. Optimizing a central operator's vehicle-to-request dispatching policy requires efficient and effective fleet control strategies. To this end, we employ a multi-agent Soft Actor-Critic algorithm combined with weighted bipartite matching. We propose a novel vehicle-based algorithm architecture and adapt the critic's loss function to appropriately consider coordinated actions. Furthermore, we extend our algorithm to incorporate rebalancing capabilities. Through numerical experiments, we show that our approach outperforms state-of-the-art benchmarks by up to 12.9% for dispatching and up to 38.9% with integrated rebalancing.
Authors: Hwihun Jeong, Se Young Chun, Jongho Lee
Deep learning-based Magnetic Resonance (MR) reconstruction methods have focused on generating high-quality images but often overlook the impact on downstream tasks (e.g., segmentation) that utilize the reconstructed images. Cascading separately trained reconstruction network and downstream task network has been shown to introduce performance degradation due to error propagation and the domain gaps between training datasets. To mitigate this issue, downstream task-oriented reconstruction optimization has been proposed for a single downstream task. In this work, we extend the optimization to handle multiple downstream tasks that are introduced sequentially via continual learning. The proposed method integrates techniques from replay-based continual learning and image-guided loss to overcome catastrophic forgetting. Comparative experiments demonstrated that our method outperformed a reconstruction network without finetuning, a reconstruction network with naïve finetuning, and conventional continual learning methods. The source code is available at: this https URL.
Authors: Fabian Jaensch, Giuseppe Caire, Begüm Demir
Several studies have explored deep learning algorithms to predict large-scale signal fading, or path loss, in urban communication networks. The goal is to replace costly measurement campaigns, inaccurate statistical models, or computationally expensive ray-tracing simulations with machine learning models that deliver quick and accurate predictions. We focus on predicting path loss radio maps using convolutional neural networks, leveraging aerial images alone or in combination with supplementary height information. Notably, our approach does not rely on explicit classification of environmental objects, which is often unavailable for most locations worldwide. While the prediction of radio maps using complete 3D environmental data is well-studied, the use of only aerial images remains under-explored. We address this gap by showing that state-of-the-art models developed for existing radio map datasets can be effectively adapted to this task. Additionally, we introduce a new model dubbed UNetDCN that achieves on par or better performance compared to the state-of-the-art with reduced complexity. The trained models are differentiable, and therefore they can be incorporated in various network optimization algorithms. While an extensive discussion is beyond this paper's scope, we demonstrate this through an example optimizing the directivity of base stations in cellular networks via backpropagation to enhance coverage.
Authors: Le Xia, Yao Sun, Haijian Sun, Rose Qingyang Hu, Dusit Niyato, Muhammad Ali Imran
Semantic communication (SemCom) has been recently deemed a promising next-generation wireless technique to enable efficient spectrum savings and information exchanges, thus naturally introducing a novel and practical network paradigm where cellular and device-to-device (D2D) SemCom approaches coexist. Nevertheless, the involved wireless resource management becomes complicated and challenging due to the unique semantic performance measurements and energy-consuming semantic coding mechanism. To this end, this paper jointly investigates power control and spectrum reuse problems for energy-efficient D2D SemCom cellular networks. Concretely, we first model the user preference-aware semantic triplet transmission and leverage a novel metric of semantic value to identify the semantic information importance conveyed in SemCom. Then, we define the additional power consumption from semantic encoding in conjunction with basic power amplifier dissipation to derive the overall system energy efficiency (semantics/Joule). Next, we formulate an energy efficiency maximization problem for joint power and spectrum allocation subject to several SemCom-related and practical constraints. Afterward, we propose an optimal resource management solution by employing the fractional-to-subtractive problem transformation and decomposition while developing a three-stage method with theoretical analysis of its optimality guarantee and computational complexity. Numerical results demonstrate the adequate performance superiority of our proposed solution compared with different benchmarks.
Authors: Jie Li, Jing Li, Zhanyu Ju, Fengkui Gong, Lu Lv
We propose a dim and small target detection algorithm for drone broadcast frames based on the time-frequency analysis of communication protocol. Specifically, by analyzing modulation parameters and frame structures, the prior knowledge of transmission frequency, signal bandwidth, Zadoff-Chu (ZC) sequences, and frame length of drone broadcast frames is established. The RF signals are processed through the designed filter banks, and the frequency domain parameters of bounding boxes generated by the detector are corrected with transmission frequency and signal bandwidth. Given the remarkable correlation characteristics of ZC sequences, the frequency domain parameters of bounding boxes with low confidence scores are corrected based on ZC sequences and frame length, which improves the detection accuracy of dim targets under low signal-to noise ratio situations. Besides, a segmented energy refinement method is applied to mitigate the deviation caused by interference signals with high energy strength, which ulteriorly corrects the time domain detection parameters for dim targets. As the sampling duration increases, the detection speed improves while the detection accuracy of broadcast frames termed as small targets decreases. The trade-off between detection accuracy and speed versus sampling duration is established, which helps to meet different drone regulation requirements. Simulation results demonstrate that the proposed algorithm improves the evaluation metrics by 2.27\% compared to existing algorithms. The proposed algorithm also performs strong robustness under varying flight distances, diverse types of environment noise, and different flight visual environment. Besides, the broadcast frame decoding results indicate that 97.30\% accuracy of RID has been achieved.
Authors: Jinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Gorno-Tempini, Gopala Anumanchipalli
Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at this https URL.
Authors: Omar Mashaal, Hatem Abou-Zeid
Foundational models have shown remarkable potential in natural language processing and computer vision, yet remain in their infancy in wireless communications. While a few efforts have explored image-based modalities such as channel state information (CSI) and frequency spectrograms, foundational models that operate directly on raw IQ data remain largely unexplored. This paper presents, IQFM, the first I/Q signal foundational model for wireless communications. IQFM supporting diverse tasks: modulation classification, angle-of-arrival (AoA), beam prediction, and RF fingerprinting, without heavy preprocessing or handcrafted features. We also introduce a task-aware augmentation strategy that categorizes transformations into core augmentations, such as cyclic time shifting, and task-specific augmentations. This strategy forms the basis for structured, task-dependent representation learning within a contrastive self-supervised learning (SSL) framework. Using this strategy, the lightweight encoder, pre-trained via SSL on over-the-air multi-antenna IQ data, achieves up to 99.67% and 65.45% accuracy on modulation and AoA classification, respectively, using only one labeled sample per class, outperforming supervised baselines by up to 7x and 145x. The model also generalizes to out-of-distribution tasks; when adapted to new tasks using only 500 samples per class and minimal parameter updates via LoRA, the same frozen encoder achieves 94.15% on beam prediction (vs. 89.53% supervised), 50.00% on RML2016a modulation classification (vs. 49.30%), and 96.05% on RF fingerprinting (vs. 96.64%). These results demonstrate the potential of raw IQ-based foundational models as efficient, reusable encoders for multi-task learning in AI-native 6G systems.
Authors: Chi Liu, Zhezhuang Xu, Jiawei Zhou, Yazhou Yuan, Kai Ma, Meng Yuan
Green buildings (GBs) with renewable energy and building energy management systems (BEMS) enable efficient energy use and support sustainable development. Electric vehicles (EVs), as flexible storage resources, enhance system flexibility when integrated with stationary energy storage systems (ESS) for real-time scheduling. However, differing degradation and operational characteristics of ESS and EVs complicate scheduling strategies. This paper proposes a model-free deep reinforcement learning (DRL) method for joint real-time scheduling based on a combined battery system (CBS) integrating ESS and EVs. We develop accurate degradation models and cost estimates, prioritize EV travel demands, and enable collaborative ESS-EV operation under varying conditions. A prediction model optimizes energy interaction between CBS and BEMS. To address heterogeneous states, action coupling, and learning efficiency, the DRL algorithm incorporates double networks, a dueling mechanism, and prioritized experience replay. Experiments show a 37.94 percent to 40.01 percent reduction in operating costs compared to a mixed-integer linear programming (MILP) approach.
Authors: Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li
Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: this https URL. The code and model weights have been released on this https URL .
Authors: Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang, Hongbin Zhou, Lei Ma, Jianjun Zhao
Multilingual speech-to-speech translation (S2ST) aims to directly convert spoken utterances from multiple source languages into fluent and intelligible speech in a target language. Despite recent progress, several critical challenges persist: 1) achieving high-quality S2ST remains a significant obstacle; 2) most existing S2ST methods rely heavily on large-scale parallel speech corpora, which are difficult and resource-intensive to obtain. To tackle these challenges, we introduce S2ST-Omni, a novel, efficient, and scalable framework tailored for multilingual speech-to-speech translation. Specifically, we decompose S2ST into speech-to-text translation (S2TT) and text-to-speech synthesis (TTS). To enable high-quality S2TT while mitigating reliance on large-scale parallel speech corpora, we leverage powerful pretrained models: Whisper for robust audio understanding and Qwen 3.0 for advanced text comprehension. A lightweight speech adapter is introduced to bridge the modality gap between speech and text representations, facilitating effective utilization of pretrained multimodal knowledge. To ensure both translation accuracy and real-time responsiveness, we adopt a streaming speech generation model in the TTS stage, which generates the target speech in an autoregressive manner. Extensive experiments conducted on the CVSS benchmark demonstrate that S2ST-Omni consistently surpasses several state-of-the-art S2ST baselines in translation quality, highlighting its effectiveness and superiority.
Authors: Chenggang Cui, Jiaming Liu, Peifeng Hui, Pengfeng Lin, Chuanlin Zhang
Designing controllers for complex industrial electronic systems is challenging due to nonlinearities and parameter uncertainties, and traditional methods are often slow and costly. To address this, we propose a novel autonomous design framework driven by Large Language Models (LLMs). Our approach employs a bi-level optimization strategy: an LLM intelligently explores and iteratively improves the control algorithm's structure, while a Particle Swarm Optimization (PSO) algorithm efficiently refines the parameters for any given structure. This method achieves end-to-end automated design. Validated through a simulation of a DC-DC Boost converter, our framework successfully evolved a basic controller into a high-performance adaptive version that met all stringent design specifications for fast response, low error, and robustness. This work presents a new paradigm for control design that significantly enhances automation and efficiency.
Authors: Julius P.J. Krebbekx, Roland Tóth, Amritam Das
In this technical communique, we develop a graphical design procedure for reset controllers for unstable LTI plants based on recent developments on Scaled Relative Graph analysis, yielding an $L_2$-gain performance bound. The stabilizing controller consists of a second order reset element in parallel with a proportional gain. The proposed method goes beyond existing approaches that are limited to stable systems only, providing a well-applicable approach to design problems in practice where the plant is unstable.
Authors: Saeed Razavikia, Carlo Fischione
Over-the-air computation (OAC) leverages the physical superposition property of wireless multiple access channels (MACs) to compute functions while communication occurs, enabling scalable and low-latency processing in distributed networks. While analog OAC methods suffer from noise sensitivity and hardware constraints, existing digital approaches are often limited in design complexity, which may hinder scalability and fail to exploit spectral efficiency fully. This two-part paper revisits and extends the ChannelComp framework, a general methodology for computing arbitrary finite-valued functions using digital modulation. In Part I, we develop a novel constellation design approach that is aware of the noise distribution and formulates the encoder design as a max-min optimization problem using noise-tailored distance metrics. Our design supports noise models, including Gaussian, Laplace, and heavy-tailed distributions. We further demonstrate that, for heavy-tailed noise, the optimal ChannelComp setup coincides with the solution to the corresponding max-min criterion for the channel noise with heavy-tailed distributions. Numerical experiments confirm that our noise-aware design achieves a substantially lower mean-square error than leading digital OAC methods over noisy MACs. In Part II, we consider a constellation design with a quantization-based sampling scheme to enhance modulation scalability and computational accuracy for large-scale digital OAC.
Authors: Saeed Razavikia, Carlo Fischione
Over-the-air computation (OAC) harnesses the natural superposition of wireless signals to compute aggregate functions during transmission, thereby collapsing communication and computation into a single step and significantly reducing latency and resource usage. In Part I, digital OAC was formulated as a noise-aware constellation design problem by casting encoder design as a max-min optimization that aligns minimum Euclidean distances between superimposed constellation points with squared differences of their corresponding function outputs. In this paper, Part II, we address the prohibitive complexity and quantization challenges inherent in digital OAC constellation design for large-scale edge networks. More precisely, we introduce a pyramid sampling strategy that judiciously selects a subset of superimposed constellation points to reduce the encoder design complexity from $\mathcal{O}(q^K)$ to $\mathcal{O}(q^{K-p+1})$, where $p\in\{1,\dots, K\}$ denotes the sampling order, $q$ levels of modulation, and $K$ denotes the number nodes in the network. Under the assumption of symmetric aggregation, this approach enables a controlled trade-off between computational complexity and function computation accuracy. As a special case, we propose majority-based sampling ($p=K$), which confines aggregation to only $q$ consensus points, inherently avoiding destructive overlaps and permitting the use of standard digital modulations (e.g., QAM, PSK, ASK) without bespoke constellation designs. We also show via several simulations, across various aggregation functions, modulation levels, and noise levels, that moderate sampling orders attain acceptable performance with orders-of-magnitude fewer constraints than exhaustive designs.
Authors: Hossein Mohammadi Firouzjaei, Rafaela Scaciota, Sumudu Samarakoon
Enhancing the sustainability and efficiency of wireless sensor networks (WSN) in dynamic and unpredictable environments requires adaptive communication and energy harvesting strategies. We propose a novel adaptive control strategy for WSNs that optimizes data transmission and EH to minimize overall energy consumption while ensuring queue stability and energy storing constraints under dynamic environmental conditions. The notion of adaptability therein is achieved by transferring the known environment-specific knowledge to new conditions resorting to the lifelong reinforcement learning concepts. We evaluate our proposed method against two baseline frameworks: Lyapunov-based optimization, and policy-gradient reinforcement learning (RL). Simulation results demonstrate that our approach rapidly adapts to changing environmental conditions by leveraging transferable knowledge, achieving near-optimal performance approximately $30\%$ faster than the RL method and $60\%$ faster than the Lyapunov-based approach. The implementation is available at our GitHub repository for reproducibility purposes [1].
Authors: Fang Chen, Weifeng Zhang, Xingyu Ai, BingXuan Li, An Li, Qiegen Liu
Positron emission tomography (PET) is widely used to assess metabolic activity, but its application is limited by the availability of radiotracers. 18F-labeled fluorodeoxyglucose (18F-FDG) is the most commonly used tracer but shows limited effectiveness for certain tumors. In contrast, 6-18F-fluoro-3,4-dihydroxy-L-phenylalanine (18F-DOPA) offers higher specificity for neuroendocrine tumors and neurological disorders. However, the complexity of its synthesis process and constraints on transportation time have limited its clinical application. Among different forms of raw data acquired by the scanner, sinogram is a commonly used representation in PET imaging. Therefore, modeling in projection domain enables more direct utilization of the original information, potentially reducing the accumulation errors during the image reconstruction process. Inspired by these factors, this study proposes a prior-guided joint diffusion model (PJDM) for transforming 18F-FDG PET sinograms into 18F-DOPA PET sinograms. During inference, an initial synthetic 18F-DOPA PET sinogram is first generated using a higher-order hybrid sampler. This sinogram is then degraded and serves as an additional condition to guide the iterative refinement process. Experimental results demonstrated that PJDM effectively improved both sinogram quality and the final synthetic outcomes. The code is available at: this https URL.
Authors: Ghalib Ahmed Tahir, Chu Kiong Loo
Automatic food detection is an emerging topic of interest due to its wide array of applications ranging from detecting food images on social media platforms to filtering non-food photos from the users in dietary assessment apps. Recently, during the COVID-19 pandemic, it has facilitated enforcing an eating ban by automatically detecting eating activities from cameras in public places. Therefore, to tackle the challenge of recognizing food images with high accuracy, we proposed the idea of a hybrid framework for extracting and selecting optimal features from an efficient neural network. There on, a nonlinear classifier is employed to discriminate between linearly inseparable feature vectors with great precision. In line with this idea, our method extracts features from MobileNetV3, selects an optimal subset of attributes by using Shapley Additive exPlanations (SHAP) values, and exploits kernel extreme learning machine (KELM) due to its nonlinear decision boundary and good generalization ability. However, KELM suffers from the 'curse of dimensionality problem' for large datasets due to the complex computation of kernel matrix with large numbers of hidden nodes. We solved this problem by proposing a novel multicolumn kernel extreme learning machine (MCKELM) which exploited the k-d tree algorithm to divide data into N subsets and trains separate KELM on each subset of data. Then, the method incorporates KELM classifiers into parallel structures and selects the top k nearest subsets during testing by using the k-d tree search for classifying input instead of the whole network. For evaluating a proposed framework large food/non-food dataset is prepared using nine publically available datasets. Experimental results showed the superiority of our method on an integrated set of measures while solving the problem of 'curse of dimensionality in KELM for large datasets.
Authors: Johannes Köhler, Melanie N. Zeilinger
To address feasibility issues in model predictive control (MPC), most implementations relax state constraints by using slack variables and adding a penalty to the cost. We propose an alternative strategy: relaxing the initial state constraint with a penalty. Compared to state-of-the-art soft constrained MPC formulations, the proposed formulation has two key features: (i) input-to-state stability and bounds on the cumulative constraint violation for large disturbances; (ii) close-to-optimal performance under nominal operating conditions. The idea is initially presented for open-loop asymptotically stable nonlinear systems by designing the penalty as a Lyapunov function, but we also show how to relax this condition to: i) Lyapunov stable systems; ii) stabilizable systems; and iii) utilizing an implicit characterization of the Lyapunov function. In the special case of linear systems, the proposed MPC formulation reduces to a quadratic program, and the offline design and online computational complexity are only marginally increased compared to a nominal design. Numerical examples demonstrate benefits compared to state-of-the-art soft-constrained MPC formulations.
Authors: Dominik Wagner, Ilja Baumann, Tobias Bocklet
Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn mappings between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests showed that our unified approach achieved improvements of up to 6.7% relative to the baseline in whispered VC. Mean opinion score predictions also yielded stable results in conventional VC (between 0.5% and 2.4% relative improvement).
Authors: Yuto Watanabe, Kazunori Sakurama
This study explores distributed optimization problems with clique-wise coupling via operator splitting and how we can utilize this framework for performance analysis and enhancement. This framework extends beyond conventional pairwise coupled problems (e.g., consensus optimization) and is applicable to broader examples. To this end, we first introduce a new distributed optimization algorithm by leveraging a clique-based matrix and the Davis-Yin splitting (DYS), a versatile three-operator splitting method. We then demonstrate that this approach sheds new light on conventional algorithms in the following way: (i) Existing algorithms (NIDS, Exact diffusion, diffusion, and our previous work) can be derived from our proposed method; (ii) We present a new mixing matrix based on clique-wise coupling, which surfaces when deriving the NIDS. We prove its preferable distribution of eigenvalues, enabling fast consensus; (iii) These observations yield a new linear convergence rate for the NIDS with non-smooth objective functions. Remarkably our linear rate is first established for the general DYS with a projection for a subspace. This case is not covered by any prior results, to our knowledge. Finally, numerical examples showcase the efficacy of our proposed approach.
Authors: Hannah J. Smith, Blanca Rodriguez, Yuling Sang, Marcel Beetz, Robin Choudhury, Vicente Grau, Abhirup Banerjee
The electrocardiogram (ECG) is used for diagnosis and risk stratification following myocardial infarction (MI). Women have a higher incidence of missed MI diagnosis and complications following infarction, and to address this we aim to provide quantitative information on sex-differences in ECG and torso-ventricular anatomy features. A novel computational automated pipeline is presented enabling the three-dimensional reconstruction of torso-ventricular anatomies for 425 post-MI subjects and 1051 healthy controls from UK Biobank clinical images. Regression models were created relating torso-ventricular and ECG parameters. For post-MI women, the heart is positioned more posteriorly and vertically, than in men (with healthy women yet more vertical). Post-MI women exhibit less QRS prolongation, requiring 27% more prolongation than men to exceed 120ms. Only half of the sex difference in QRS is associated with smaller female cavities. Lower STj amplitude in women is striking, associated with smaller ventricles, but also more superior and posterior cardiac position. Post-MI, T wave amplitude and R axis deviations are strongly associated with a more posterior and horizontal cardiac position in women (but not in men). Our study highlights the need to quantify sex differences in anatomical features, their implications in ECG interpretation, and the application of clinical ECG thresholds in post-MI.
Authors: Eduardo Sebastian, Thai Duong, Nikolay Atanasov, Eduardo Montijano, Carlos Sagues
The networked nature of multi-robot systems presents challenges in the context of multi-agent reinforcement learning. Centralized control policies do not scale with increasing numbers of robots, whereas independent control policies do not exploit the information provided by other robots, exhibiting poor performance in cooperative-competitive tasks. In this work we propose a physics-informed reinforcement learning approach able to learn distributed multi-robot control policies that are both scalable and make use of all the available information to each robot. Our approach has three key characteristics. First, it imposes a port-Hamiltonian structure on the policy representation, respecting energy conservation properties of physical robot systems and the networked nature of robot team interactions. Second, it uses self-attention to ensure a sparse policy representation able to handle time-varying information at each robot from the interaction graph. Third, we present a soft actor-critic reinforcement learning algorithm parameterized by our self-attention port-Hamiltonian control policy, which accounts for the correlation among robots during training while overcoming the need of value function factorization. Extensive simulations in different multi-robot scenarios demonstrate the success of the proposed approach, surpassing previous multi-robot reinforcement learning solutions in scalability, while achieving similar or superior performance (with averaged cumulative reward up to x2 greater than the state-of-the-art with robot teams x6 larger than the number of robots at training time). We also validate our approach on multiple real robots in the Georgia Tech Robotarium under imperfect communication, demonstrating zero-shot sim-to-real transfer and scalability across number of robots.
Authors: Kai Zhang, Xuanyu Cao, Khaled B. Letaief
Federated learning (FL) necessitates that edge devices conduct local training and communicate with a parameter server, resulting in significant energy consumption. A key challenge in practical FL systems is the rapid depletion of battery-limited edge devices, which limits their operational lifespan and impacts learning performance. To tackle this issue, we implement energy harvesting techniques in FL systems to capture ambient energy, thereby providing continuous power to edge devices. We first establish the convergence bound for the wireless FL system with energy harvesting devices, illustrating that the convergence is affected by partial device participation and packet drops, both of which depend on the energy supply. To accelerate the convergence, we formulate a joint device scheduling and power control problem and model it as a Markov decision process (MDP). By solving this MDP, we derive the optimal transmission policy and demonstrate that it possesses a monotone structure with respect to the battery and channel states. To overcome the curse of dimensionality caused by the exponential complexity of computing the optimal policy, we propose a low-complexity algorithm, which is asymptotically optimal as the number of devices increases. Furthermore, for unknown channels and harvested energy statistics, we develop a structure-enhanced deep reinforcement learning algorithm that leverages the monotone structure of the optimal policy to improve the training performance. Finally, extensive numerical experiments on real-world datasets are presented to validate the theoretical results and corroborate the effectiveness of the proposed algorithms.
Authors: John M McBride, Nahie Kim, Yuri Nishikawa, Mekhmed Saadakeev, Marcus T Pearce, Tsvi Tlusty
The number of possible melodies is unfathomably large, yet despite this virtually unlimited potential for melodic variation, melodies from different societies can be surprisingly similar. The motor constraint hypothesis accounts for certain similarities, such as scalar motion and contour shape, but not for other major common features, such as repetition, song length, and scale size. Here we investigate the role of information constraints in shaping these hallmarks of melodies. We measure determinants of information rate in 62 corpora of Folk melodies spanning several continents, finding multiple trade-offs that all act to constrain the information rate across societies. By contrast, 39 corpora of Art music from Europe (including Turkey) show longer, more complex melodies, and increased complexity over time, suggesting different cultural-evolutionary selection pressures in Art and Folk music, possibly due to the use of written versus oral transmission. Our parameter-free model predicts the empirical scale degree distribution using information constraints on scalar motion, melody length, and, most importantly, information rate. These results provide strong evidence that information constraints during cultural transmission of music limit the number of notes in a scale, and suggests that a tendency for intermediate melodic complexity reflects a fundamental constraint on the cultural evolution of melody.
Authors: Jinbo Hou, Kehai Qiu, Zitian Zhang, Yong Yu, Kezhi Wang, Stefano Capolongo, Jiliang Zhang, Zeyang Li, Jie Zhang
This paper aims to simultaneously optimize indoor wireless and daylight performance by adjusting the positions of windows and the beam directions of window-deployed reconfigurable intelligent surfaces (RISs) for RIS-aided outdoor-to-indoor (O2I) networks utilizing large language models (LLM) as optimizers. Firstly, we illustrate the wireless and daylight system models of RIS-aided O2I networks and formulate a joint optimization problem to enhance both wireless traffic sum rate and daylight illumination performance. Then, we present a multi-modal LLM-based window optimization (LMWO) framework, accompanied by a prompt construction template to optimize the overall performance in a zero-shot fashion, functioning as both an architect and a wireless network planner. Finally, we analyze the optimization performance of the LMWO framework and the impact of the number of windows, room size, number of RIS units, and daylight factor. Numerical results demonstrate that our proposed LMWO framework can achieve outstanding optimization performance in terms of initial performance, convergence speed, final outcomes, and time complexity, compared with classic optimization methods. The building's wireless performance can be significantly enhanced while ensuring indoor daylight performance.
Authors: Hadi Mehdizavareh, Arijit Khan, Simon Lebech Cichosz
Accurately predicting blood glucose (BG) levels of ICU patients is critical, as both hypoglycemia (BG < 70 mg/dL) and hyperglycemia (BG > 180 mg/dL) are associated with increased morbidity and mortality. This study presents a proof-of-concept machine learning framework, the Multi-source Irregular Time-Series Transformer (MITST), designed to predict BG levels in ICU patients. In contrast to existing methods that rely heavily on manual feature engineering or utilize limited Electronic Health Record (EHR) data sources, MITST integrates diverse clinical data--including laboratory results, medications, and vital signs without predefined aggregation. The model leverages a hierarchical Transformer architecture, designed to capture interactions among features within individual timestamps, temporal dependencies across different timestamps, and semantic relationships across multiple data sources. Evaluated using the extensive eICU database (200,859 ICU stays across 208 hospitals), MITST achieves a statistically significant ( p < 0.001 ) average improvement of 1.7 percentage points (pp) in AUROC and 1.8 pp in AUPRC over a state-of-the-art random forest baseline. Crucially, for hypoglycemia--a rare but life-threatening condition--MITST increases sensitivity by 7.2 pp, potentially enabling hundreds of earlier interventions across ICU populations. The flexible architecture of MITST allows seamless integration of new data sources without retraining the entire model, enhancing its adaptability for clinical decision support. While this study focuses on predicting BG levels, we also demonstrate MITST's ability to generalize to a distinct clinical task (in-hospital mortality prediction), highlighting its potential for broader applicability in ICU settings. MITST thus offers a robust and extensible solution for analyzing complex, multi-source, irregular time-series data.
Authors: Tengjie Zheng, Haipeng Chen, Lin Cheng, Shengping Gong, Xu Huang
Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques.
Authors: Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Mingkun Xu, Ruizhen Hu, Hongyuan Zhang, Xuelong Li, Ping Luo
Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
Authors: Sho Inoue, Kun Zhou, Shuai Wang, Haizhou Li
Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a flow-matching based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
Authors: Fabian Jakob, Andrea Iannelli
In this paper we propose a framework to analyze iterative first-order optimization algorithms for time-varying convex optimization. We assume that the temporal variability is caused by a time-varying parameter entering the objective, which can be measured at the time of decision but whose future values are unknown. We consider the case of strongly convex objective functions with Lipschitz continuous gradients under a convex constraint set. We model the algorithms as discrete-time linear parameter varying (LPV) systems in feedback with monotone operators such as the time-varying gradient. We leverage the approach of analyzing algorithms as uncertain control interconnections with integral quadratic constraints (IQCs) and generalize that framework to the time-varying case. We propose novel IQCs that are capable of capturing the behavior of time-varying nonlinearities and leverage techniques from the LPV literature to establish novel bounds on the tracking error. Quantitative bounds can be computed by solving a semi-definite program and can be interpreted as an input-to-state stability result with respect to a disturbance signal which increases with the temporal variability of the problem. As a departure from results in this research area, our bounds introduce a dependence on different additional measures of temporal variations, such as the function value and gradient rate of change. We exemplify our main results with numerical experiments that showcase how our analysis framework is able to capture convergence rates of different first-order algorithms for time-varying optimization through the choice of IQC and rate bounds.
Authors: Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, Zhizheng Wu
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at this https URL.
Authors: Xiaofeng Lin, Enduo Zhao, Saúl Alexis Heredia Pérez, Kanako Harada
Estimating the state of biological specimens is challenging due to limited observation through microscopic vision. For instance, during mouse skull drilling, the appearance alters little when thinning bone tissue because of its semi-transparent property and the high-magnification microscopic vision. To obtain the object's state, we introduce an object state estimation method for biological specimens through active interaction based on the deflection. The method is integrated to enhance the autonomous drilling system developed in our previous work. The method and integrated system were evaluated through 12 autonomous eggshell drilling experiment trials. The results show that the system achieved a 91.7% successful ratio and 75% detachable ratio, showcasing its potential applicability in more complex surgical procedures such as mouse skull craniotomy. This research paves the way for further development of autonomous robotic systems capable of estimating the object's state through active interaction.
Authors: Qi Mao, Haobo Hu, Yujie He, Difei Gao, Haokun Chen, Libiao Jin
Affective Image Manipulation (AIM) aims to alter visual elements within an image to evoke specific emotional responses from viewers. However, existing AIM approaches rely on rigid \emph{one-to-one} mappings between emotions and visual cues, making them ill-suited for the inherently subjective and diverse ways in which humans perceive and express this http URL address this, we introduce a novel task setting termed \emph{Diverse AIM (D-AIM)}, aiming to generate multiple visually distinct yet emotionally consistent image edits from a single source image and target emotion. We propose \emph{EmoAgent}, the first multi-agent framework tailored specifically for D-AIM. EmoAgent explicitly decomposes the manipulation process into three specialized phases executed by collaborative agents: a Planning Agent that generates diverse emotional editing strategies, an Editing Agent that precisely executes these strategies, and a Critic Agent that iteratively refines the results to ensure emotional accuracy. This collaborative design empowers EmoAgent to model \emph{one-to-many} emotion-to-visual mappings, enabling semantically diverse and emotionally faithful this http URL quantitative and qualitative evaluations demonstrate that EmoAgent substantially outperforms state-of-the-art approaches in both emotional fidelity and semantic diversity, effectively generating multiple distinct visual edits that convey the same target emotion.
Authors: Berken Utku Demirel, Adnan Harun Dogan, Juliete Rossie, Max Moebus, Christian Holz
Virtual reality (VR) presents immersive opportunities across many applications, yet the inherent risk of developing cybersickness during interaction can severely reduce enjoyment and platform adoption. Cybersickness is marked by symptoms such as dizziness and nausea, which previous work primarily assessed via subjective post-immersion questionnaires and motion-restricted controlled setups. In this paper, we investigate the \emph{dynamic nature} of cybersickness while users experience and freely interact in VR. We propose a novel method to \emph{continuously} identify and quantitatively gauge cybersickness levels from users' \emph{passively monitored} electroencephalography (EEG) and head motion signals. Our method estimates multitaper spectrums from EEG, integrating specialized EEG processing techniques to counter motion artifacts, and, thus, tracks cybersickness levels in real-time. Unlike previous approaches, our method requires no user-specific calibration or personalization for detecting cybersickness. Our work addresses the considerable challenge of reproducibility and subjectivity in cybersickness research.
Authors: Dim Shaiakhmetov, Gulnaz Gimaletdinova, Kadyrmamat Momunov, Selcuk Cankurt
Proper recitation of the Quran, adhering to the rules of Tajweed, is crucial for preventing mistakes during recitation and requires significant effort to master. Traditional methods of teaching these rules are limited by the availability of qualified instructors and time constraints. Automatic evaluation of recitation can address these challenges by providing prompt feedback and supporting independent practice. This study focuses on developing a deep learning model to classify three Tajweed rules - separate stretching (Al Mad), tight noon (Ghunnah), and hide (Ikhfaa) - using the publicly available QDAT dataset, which contains over 1,500 audio recordings. The input data consisted of audio recordings from this dataset, transformed into normalized mel-spectrograms. For classification, the EfficientNet-B0 architecture was used, enhanced with a Squeeze-and-Excitation attention mechanism. The developed model achieved accuracy rates of 95.35%, 99.34%, and 97.01% for the respective rules. An analysis of the learning curves confirmed the model's robustness and absence of overfitting. The proposed approach demonstrates high efficiency and paves the way for developing interactive educational systems for Tajweed study.
Authors: Jie Chen, Xianbin Wang, Dusit Niyato
Due to the growing complexity of vertical applications, current integrated sensing and communications (ISAC) in wireless networks remains insufficient for supporting all required beyond communication services. To this end, future networks are evolving toward an integrated heterogeneous service provisioning (IHSP) platform, which seeks to integrate a broad range of heterogeneous services beyond the dual-function scope of ISAC. Nevertheless, this trend intensifies conflicts among concurrent heterogeneous service requirements under constrained resource sharing. In this paper, we overcome this challenge by the joint use of two novel elastic design strategies: compromised service value assessment and flexible multi-dimensional resource multiplexing. Consequently, we propose a value-prioritized elastic multi-dimensional multiple access (MDMA) mechanism for IHSP systems. First, we modify the Value-of-Service (VoS) metric by incorporating elastic parameters to characterize user-specific tolerance and compromise in response to various performance degradations under constrained resources. This VoS metric serves as the foundation for prioritizing services and enabling effective fairness service scheduling among concurrent competing demands. Next, we adapt the MDMA to elastically multiplex services using appropriate multiple access schemes across different resource domains. This protocol leverages user-specific interference tolerances and cancellation capabilities across different domains to reduce resource-demanding conflicts and co-channel interference within the same domain. Then, we maximize the system's VoS by jointly optimizing MDMA and power allocation. Since this problem is non-convex, we develop a monotonic optimization-assisted dynamic programming algorithm for the optimal solution and a VoS-prioritized successive convex approximation algorithm for efficient suboptimal computation.
Authors: Sanjit Dandapanthula, Aaditya Ramdas
Changepoint localization is the problem of estimating the index at which a change occurred in the data generating distribution of an ordered list of data, or declaring that no change occurred. We present the broadly applicable CONCH (CONformal CHangepoint localization) algorithm, which uses a matrix of conformal p-values to produce a confidence interval for a (single) changepoint under the mild assumption that the pre-change and post-change distributions are each exchangeable. We exemplify the CONCH algorithm on a variety of synthetic and real-world datasets, including using black-box pre-trained classifiers to detect changes in sequences of images or text.
Authors: Zhengyan Sheng, Jinghao He, Liping Chen, Kong Aik Lee, Zhen-Hua Ling
Voice timbre refers to the unique quality or character of a person's voice that distinguishes it from others as perceived by human hearing. The Voice Timbre Attribute Detection (VtaD) 2025 challenge focuses on explaining the voice timbre attribute in a comparative manner. In this challenge, the human impression of voice timbre is verbalized with a set of sensory descriptors, including bright, coarse, soft, magnetic, and so on. The timbre is explained from the comparison between two voices in their intensity within a specific descriptor dimension. The VtaD 2025 challenge starts in May and culminates in a special proposal at the NCMMSC2025 conference in October 2025 in Zhenjiang, China.
Authors: Jinghao He, Zhengyan Sheng, Liping Chen, Kong Aik Lee, Zhen-Hua Ling
This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website this https URL.
Authors: Paula Feldman, Martin Sinnona, Claudio Delrieux, Viviana Siless, Emmanuel Iarussi
Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous' methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code is available at this https URL.
Authors: Mohamad Mestoukirdi, Mourad Khanfouci
This work proposes a new algorithm to mitigate model generalization loss in Vertical Federated Learning (VFL) operating under client reliability constraints within 5G Core Networks (CNs). Recently studied and endorsed by 3GPP, VFL enables collaborative and load-balanced model training and inference across the CN. However, the performance of VFL significantly degrades when the Network Data Analytics Functions (NWDAFs) - which serve as primary clients for VFL model training and inference - experience reliability issues stemming from resource constraints and operational overhead. Unlike edge environments, CN environments adopt fundamentally different data management strategies, characterized by more centralized data orchestration capabilities. This presents opportunities to implement better distributed solutions that take full advantage of the CN data handling flexibility. Leveraging this flexibility, we propose a method that optimizes the vertical feature split among clients while centrally defining their local models based on reliability metrics. Our empirical evaluation demonstrates the effectiveness of our proposed algorithm, showing improved performance over traditional baseline methods.
Authors: Longjie Luo, Shenghui Lu, Lin Li, Qingyang Hong
This paper presents our system for the MISP-Meeting Challenge Track 2. The primary difficulty lies in the dataset, which contains strong background noise, reverberation, overlapping speech, and diverse meeting topics. To address these issues, we (a) designed G-SpatialNet, a speech enhancement (SE) model to improve Guided Source Separation (GSS) signals; (b) proposed TLS, a framework comprising time alignment, level alignment, and signal-to-noise ratio filtering, to generate signal-level pseudo labels for real-recorded far-field audio data, thereby facilitating SE models' training; and (c) explored fine-tuning strategies, data augmentation, and multimodal information to enhance the performance of pre-trained Automatic Speech Recognition (ASR) models in meeting scenarios. Finally, our system achieved character error rates (CERs) of 5.44% and 9.52% on the Dev and Eval sets, respectively, with relative improvements of 64.8% and 52.6% over the baseline, securing second place.
Authors: Longjie Luo, Lin Li, Qingyang Hong
Due to the lack of target speech annotations in real-recorded far-field conversational datasets, speech enhancement (SE) models are typically trained on simulated data. However, the trained models often perform poorly in real-world conditions, hindering their application in far-field speech recognition. To address the issue, we (a) propose direct sound estimation (DSE) to estimate the oracle direct sound of real-recorded data for SE; and (b) present a novel pseudo-supervised learning method, SuPseudo, which leverages DSE-estimates as pseudo-labels and enables SE models to directly learn from and adapt to real-recorded data, thereby improving their generalization capability. Furthermore, an SE model called FARNET is designed to fully utilize SuPseudo. Experiments on the MISP2023 corpus demonstrate the effectiveness of SuPseudo, and our system significantly outperforms the previous state-of-the-art. A demo of our method can be found at this https URL.
Authors: Ebenezer Tarubinga, Jenifer Kalafatovich, Seong-Whan Lee
Semi-supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo-labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holistic framework that transforms uncertainty into a learning asset through four principal components: (1) fuzzy pseudo-labeling, which preserves soft class distributions from top-K predictions to enrich supervision; (2) uncertainty-aware dynamic weighting, that modulate pixel-wise contributions via entropy-based reliability scores; (3) adaptive class rebalancing, which dynamically adjust losses to counteract long-tailed class distributions; and (4) lightweight contrastive regularization, that encourage compact and discriminative feature embeddings. Extensive experiments on benchmarks demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements in the segmentation of under-represented classes and ambiguous regions.
Authors: Nuwan Bandara, Thivya Kandappu, Archan Misra
Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments.
Authors: Mehmet Ozgur Turkoglu, Selene Ledain, Helge Aasen
Conventional benchmarks for crop type classification from optical satellite time series typically assume access to labeled data from the same year and rely on fixed calendar-day sampling. This limits generalization across seasons, where crop phenology shifts due to interannual climate variability, and precludes real-time application when current-year labels are unavailable. Furthermore, uncertainty quantification is often neglected, making such approaches unreliable for crop monitoring applications. Inspired by ecophysiological principles of plant growth, we propose a simple, model-agnostic sampling strategy that leverages growing degree days (GDD), based on daily average temperature, to replace calendar time with thermal time. By uniformly subsampling time series in this biologically meaningful domain, the method emphasizes phenologically active growth stages while reducing temporal redundancy and noise. We evaluate the method on a multi-year Sentinel-2 dataset spanning all of Switzerland, training on one growing season and testing on other seasons. Compared to state-of-the-art baselines, our method delivers substantial gains in classification accuracy and, critically, produces more calibrated uncertainty estimates. Notably, our method excels in low-data regimes and enables significantly more accurate early-season classification. With only 10 percent of the training data, our method surpasses the state-of-the-art baseline in both predictive accuracy and uncertainty estimation, and by the end of June, it achieves performance similar to a baseline trained on the full season. These results demonstrate that leveraging temperature data not only improves predictive performance across seasons but also enhances the robustness and trustworthiness of crop-type mapping in real-world applications.
Authors: Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.
Authors: Masahiko Ueda, Shoma Yagi, Genki Ichinose
An oligopoly is a market in which the price of a goods is controlled by a few firms. Cournot introduced the simplest game-theoretic model of oligopoly, where profit-maximizing behavior of each firm results in market failure. Furthermore, when the Cournot oligopoly game is infinitely repeated, firms can tacitly collude to monopolize the market. Such tacit collusion is realized by the same mechanism as direct reciprocity in the repeated prisoner's dilemma game, where mutual cooperation can be realized whereas defection is favorable for both prisoners in one-shot game. Recently, in the repeated prisoner's dilemma game, a class of strategies called zero-determinant strategies attracts much attention in the context of direct reciprocity. Zero-determinant strategies are autocratic strategies which unilaterally control payoffs of players. There were many attempts to find zero-determinant strategies in other games and to extend them so as to apply them to broader situations. In this paper, first, we show that zero-determinant strategies exist even in the repeated Cournot oligopoly game. Especially, we prove that an averagely unbeatable zero-determinant strategy exists, which is guaranteed to obtain the average payoff of the opponents. Second, we numerically show that the averagely unbeatable zero-determinant strategy can be used to promote collusion when it is used against an adaptively learning player, whereas it cannot promote collusion when it is used against two adaptively learning players. Our findings elucidate some negative impact of zero-determinant strategies in oligopoly market.
Authors: Sreeja Roy-Singh, Alan P. Li, Vinay Ravindra, Roderick Lammers, Marc Sanchez Net
Fully re-orientable small spacecraft are now supported by commercial technologies, allowing them to point their instruments in any direction and capture images, with short notice. When combined with improved onboard processing, and implemented on a constellation of inter-communicable satellites, this intelligent agility can significantly increase responsiveness to transient or evolving phenomena. We demonstrate a ground-based and onboard algorithmic framework that combines orbital mechanics, attitude control, inter-satellite communication, intelligent prediction and planning to schedule the time-varying, re-orientation of agile, small satellites in a constellation. Planner intelligence is improved by updating the predictive value of future space-time observations based on shared observations of evolving episodic precipitation and urban flood forecasts. Reliable inter-satellite communication within a fast, dynamic constellation topology is modeled in the physical, access control and network layer. We apply the framework on a representative 24-satellite constellation observing 5 global regions. Results show appropriately low latency in information exchange (average within 1/3rd available time for implicit consensus), enabling the onboard scheduler to observe ~7% more flood magnitude than a ground-based implementation. Both onboard and offline versions performed ~98% better than constellations without agility.
Authors: Seyed Mohsen Hosseini
Class imbalance and the difficulty imbalance are the two types of data imbalance that affect the performance of neural networks in medical segmentation tasks. In class imbalance the loss is dominated by the majority classes and in difficulty imbalance the loss is dominated by easy to classify pixels. This leads to an ineffective training. Dice loss, which is based on a geometrical metric, is very effective in addressing the class imbalance compared to the cross entropy (CE) loss, which is adopted directly from classification tasks. To address the difficulty imbalance, the common approach is employing a re-weighted CE loss or a modified Dice loss to focus the training on difficult to classify areas. The existing modification methods are computationally costly and with limited success. In this study we propose a simple modification to the Dice loss with minimal computational cost. With a pixel level modulating term, we take advantage of the effectiveness of Dice loss in handling the class imbalance to also handle the difficulty imbalance. Results on three commonly used medical segmentation tasks show that the proposed Pixel-wise Modulated Dice loss (PM Dice loss) outperforms other methods, which are designed to tackle the difficulty imbalance problem.
Authors: Zakria Qadir, Muhammad Bilal, Guoqiang Liu, Xiaolong Xu
The unmanned aerial vehicles (UAVs) in a disaster-prone environment plays important role in assisting the rescue services and providing the internet connectivity with the outside world. However, in such a complex environment the selection of optimum trajectory of UAVs is of utmost importance. UAV trajectory optimization deals with finding the shortest path in the minimal possible time. In this paper, a cluster optimization scheme (COS) is proposed using the Henry gas optimization (HGO) metaheuristic algorithm to identify the shortest path having minimal transportation cost and algorithm complexity. The mathematical model is designed for COS using the HGO algorithm and compared with the state-of-the-art metaheuristic algorithms such as particle swarm optimization (PSO), grey wolf optimization (GWO), cuckoo search algorithm (CSA) and barnacles mating optimizer (BMO). In order to prove the robustness of the proposed model, four different scenarios are evaluated that includes ambient environment, constrict environment, tangled environment, and complex environment. In all the aforementioned scenarios, the HGO algorithm outperforms the existing algorithms. Particularly, in the ambient environment, the HGO algorithm achieves a 39.3% reduction in transportation cost and a 16.8% reduction in computational time as compared to the PSO algorithm. Hence, the HGO algorithm can be used for autonomous trajectory optimization of UAVs in smart cities.
Authors: Saeed Razavikia, Carlo Fischione
Over-the-air computation (OAC) leverages the physical superposition property of wireless multiple access channels (MACs) to compute functions while communication occurs, enabling scalable and low-latency processing in distributed networks. While analog OAC methods suffer from noise sensitivity and hardware constraints, existing digital approaches are often limited in design complexity, which may hinder scalability and fail to exploit spectral efficiency fully. This two-part paper revisits and extends the ChannelComp framework, a general methodology for computing arbitrary finite-valued functions using digital modulation. In Part I, we develop a novel constellation design approach that is aware of the noise distribution and formulates the encoder design as a max-min optimization problem using noise-tailored distance metrics. Our design supports noise models, including Gaussian, Laplace, and heavy-tailed distributions. We further demonstrate that, for heavy-tailed noise, the optimal ChannelComp setup coincides with the solution to the corresponding max-min criterion for the channel noise with heavy-tailed distributions. Numerical experiments confirm that our noise-aware design achieves a substantially lower mean-square error than leading digital OAC methods over noisy MACs. In Part II, we consider a constellation design with a quantization-based sampling scheme to enhance modulation scalability and computational accuracy for large-scale digital OAC.
Authors: Haiyang Miao, Jianhua Zhang, Pan Tang, Heng Wang, Lei Tian, Guangyi Liu
With the increase of multiple-input-multiple-output (MIMO) array size and carrier frequency, near-field MIMO communications will become crucial in 6G wireless networks. Due to the increase of MIMO near-field range, the research of near-field MIMO capacity has aroused wide interest. In this paper, we focus on the theoretical analysis and empirical study of near-field MIMO capacity. First, the near-field channel model is characterized from the electromagnetic information perspective. Second, with the uniform planar array (UPA), the channel capacity based on effective degree of freedom (EDoF) is analyzed theoretically, and the closed-form analytical expressions are derived in detail. Finally, based on the numerical verification of near-field channel measurement experiment at 13 GHz band, we reveal that the channel capacity of UPA-type MIMO systems decreases continuously with the communication distance increasing. It can be observed that the near-field channel capacity gain is relatively obvious when large-scale MIMO is adopted at both receiving and transmitter ends, but the near-field channel capacity gain may be limited in the actual communication system with the small antenna array at receiving end. This work will give some reference to the near-field communication systems.
Authors: Chen Xu, Xianghao Yu, Fan Liu, Shi Jin
Integrated sensing and communications (ISAC) is one of the key enabling technologies in future sixth-generation (6G) networks. Current ISAC systems predominantly rely on deterministic pilot signals within the signal frame to accomplish sensing tasks. However, these pilot signals typically occupy only a small portion, e.g., 0.15% to 25%, of the time-frequency resources. To enhance the system utility, a promising solution is to repurpose the extensive random data payload signals for sensing tasks. In this paper, we analyze the ISAC performance of a multi-antenna system where both deterministic pilot and random data symbols are employed for sensing tasks. By capitalizing on random matrix theory (RMT), we first derive a semi-closed-form asymptotic expression of the ergodic linear minimum mean square error (ELMMSE). Then, we formulate an ISAC precoding optimization problem to minimize the ELMMSE, which is solved via a specifically tailored successive convex approximation (SAC) algorithm. To provide system insights, we further derive a closed-form expression for the asymptotic ELMMSE at high signal-to-noise ratios (SNRs). Our analysis reveals that, compared with conventional sensing implemented by deterministic signals, the sensing performance degradation induced by random signals is critically determined by the ratio of the transmit antenna size to the data symbol length. Based on this result, the ISAC precoding optimization problem at high SNRs is transformed into a convex optimization problem that can be efficiently solved. Simulation results validate the accuracy of the derived asymptotic expressions of ELMMSE and the performance of the proposed precoding schemes. Particularly, by leveraging data payload signals for sensing tasks, the sensing error is reduced by up to 5.6 dB compared to conventional pilot-based sensing.
Authors: Rang Liu, Ming Li, Mehdi Zafari, Bjorn Ottersten, A. Lee Swindlehurst
Integrated sensing and communication (ISAC) has emerged as a key feature for sixth-generation (6G) networks, providing an opportunity to meet the dual demands of communication and sensing. Existing ISAC research primarily focuses on baseband optimization at individual access points, with limited attention to the roles of electromagnetic (EM) shaping and network-wide coordination. The intricate interdependencies between these domains remain insufficiently explored, leaving their full potential for enhancing ISAC performance largely untapped. To bridge this gap, we consider multi-domain ISAC optimization integrating EM shaping, baseband processing, and network cooperation strategies that facilitate efficient resource management and system-level design. We analyze the fundamental trade-offs between these domains and offer insights into domain-specific and cross-domain strategies contributing to ISAC performance and efficiency. We then conduct a case study demonstrating the effectiveness of joint multi-domain optimization. Finally, we discuss key challenges and future research directions to connect theoretical advancements and practical ISAC deployments. This work paves the way for intelligent and scalable ISAC architectures, providing critical insights for their seamless integration into next-generation wireless networks.
Authors: Xun Liu, Xiaobin Wu, Jiaqi He, Rajan Das Gupta
This study explores the effectiveness of predictive maintenance models and the optimization of intelligent Operation and Maintenance (O&M) systems in improving wind power generation efficiency. Through qualitative research, structured interviews were conducted with five wind farm engineers and maintenance managers, each with extensive experience in turbine operations. Using thematic analysis, the study revealed that while predictive maintenance models effectively reduce downtime by identifying major faults, they often struggle with detecting smaller, gradual failures. Key challenges identified include false positives, sensor malfunctions, and difficulties in integrating new models with older turbine systems. Advanced technologies such as digital twins, SCADA systems, and condition monitoring have significantly enhanced turbine maintenance practices. However, these technologies still require improvements, particularly in AI refinement and real-time data integration. The findings emphasize the need for continuous development to fully optimize wind turbine performance and support the broader adoption of renewable energy.
Authors: Francesco Conte (1), Fernando Mancilla-David (2), Amritansh Sagar (1), Chendan Li (3), Federico Silvestro (3), Samuele Grillo (2) ((1) Facoltà Dipartimentale di Ingegneria, Università Campus Bio-Medico di Roma, Rome, Italy, (2) Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy, (3) Dipartimento di Ingegneria Navale, Elettrica, Elettronica e delle Telecomunicazioni, Università degli Studi di Genova, Genoa, Italy)
This paper presents a detailed small-signal stability analysis of a modified version of the Cigré European high-voltage network, where one of the synchronous generators is replaced by a grid-following inverter-based resource (IBR). The analysis focuses on the influence of the parameters defining the grid-following IBR control scheme on the stability of the system. Given a set of potential grid configurations and the value of the IBR control parameters, stability is verified by the direct eigenvalue analysis of a high-detailed linearized model of the overall Cigré network. Starting from this procedure, we propose an adaptive sampling method for training a support vector machine classifier able to estimate the probability of stability of the power system over a domain defined by candidate intervals of the considered parameters. The training of the classifier is refined to identify with more accuracy the boundaries of the parameters' stability regions. The obtained results are then compared with those obtained by representing the grid with the classical Thévenin equivalent. Results suggest that, when the Thévenin equivalent is accurate, the predicted stability region is conservative yet contained within that of the full network.
Authors: Haotian Yao, Vahid Hakimian, Mostafa Farrokhabadi, Hamidreza Zareipour
Since the beginning of this century, there has been a growing body of research and developments supporting the participation of energy storage systems (ESS) in the emission reduction mandates. However, regardless of these efforts and despite the need for an accelerated energy transition, we have yet to see a practical framework for operational carbon accounting and credit trading for energy storage systems. In this context, this paper proposes an emission performance credits (EPCs) framework that allows ESS, down to the prosumer level, to participate in the carbon market. Thus, a mechanism is proposed, for the first time, to calculate the grid's real-time marginal emission intensity (MEI). The MEI is then used to optimize the cumulative operational emission of ESS through carbon-aware dispatch. Consequently, the framework tracks the operational emissions and converts them into EPCs, which are then sold to regulated entities under compliance programs. Simulation results support the potential of ESS, regardless of their size, to participate in the broader carbon mitigation objectives.
Authors: Orcun Karaca, Ioannis Tsoumas, Mario Schweizer, Ognjen Stanojev, Lennart Harnefors
This paper introduces a universal power synchronization controller for grid-side control of the wind turbine conversion systems in an offshore wind farm with a diode rectifier in the offshore substation of the HVDC link. The controller incorporates voltage-power droop controllers in the outer loop to enable the operation of this setup. To effectively handle the impact of large delays during black start and power ramp phases, virtual active and reactive power quantities are defined. These quantities are computed based on the current references prior to any modifications that might be needed to meet converter current and voltage limits or source constraints. Utilizing them in the outer loop ensures a balanced power sharing and a stable operation whenever the original (unmodified) current references are not realized. Case studies confirm the robustness of the proposed controller.
Authors: Muhammad Umer, Muhammad Ahmed Mohsin, Aamir Mahmood, Haejoon Jung, Haris Pervaiz, Mikael Gidlund, Syed Ali Hassan
This paper investigates the synergistic potential of reconfigurable intelligent surfaces (RIS) and non-orthogonal multiple access (NOMA) to enhance the energy efficiency and performance of next-generation wireless networks. We delve into the design of energy-efficient passive beamforming (PBF) strategies within RIS-assisted coordinated multi-point (CoMP)-NOMA networks. Two distinct RIS configurations, namely, enhancement-only PBF (EO) and enhancement & cancellation PBF (EC), are proposed and analyzed. Our findings demonstrate that RIS-assisted CoMP-NOMA networks offer significant efficiency gains compared to traditional CoMP-NOMA systems. Furthermore, we formulate a PBF design problem to optimize the RIS phase shifts for maximizing energy efficiency. Our results reveal that the optimal PBF design is contingent upon several factors, including the number of cooperating base stations (BSs), the number of RIS elements deployed, and the RIS configuration. This study underscores the potential of RIS-assisted CoMP-NOMA networks as a promising solution for achieving superior energy efficiency and overall performance in future wireless networks.
Authors: Qingqing Wu, Yanze Zhu, Qiaoyan Peng, Wanming Hao, Yanzhao Hou, Fengyuan Yang, Wencai Yan, Guoning Wang, Wen Chen, Chi Qiu
Intelligent reflecting surfaces (IRSs) have emerged as a cost-effective technology for terahertz (THz) communications by enabling programmable control of the wireless environment. This paper provides a comprehensive overview of IRSs-aided THz communications, covering hardware designs, advanced signal processing techniques, and practical deployment strategies. It first examines key THz reconfigurable metasurface architectures, including electronic, optical, phase-change material, and micro-electromechanical systems (MEMS)-based implementations, highlighting their reconfiguration mechanisms and challenges. Then, fundamental effects including near field and beam squint in wideband THz systems are analyzed, along with their impacts on system performance. The paper further explores conventional and beam-squint-assisted channel estimation methods, innovative beam management strategies, and deployment considerations across large- and small-scale scenarios. Practical experiments at 220 gigahertz (GHz) validate the effectiveness of IRS in improving signal strength and communication reliability for both single-user and multi-user setups.
Authors: Jennifer Bondarchuk, Anthony Trezza, Donald J. Bucci Jr
Adaptive track initiation remains a crucial component of many modern multi-target tracking systems. For labeled random finite sets multi-object filters, prior work has been established to construct a labeled multi-object birth density using measurements from multiple sensors. A naive construction of this adaptive birth set density results in an exponential number of newborn components in the number of sensors. A truncation procedure was provided that leverages a Gibbs sampler to truncate the birth density, reducing the complexity to quadratic in the number of sensors. However, only a limited discussion has been provided on additional algorithmic techniques that can be employed to substantially reduce the complexity in practical tracking applications. In this paper, we propose five efficiency enhancements for the labeled random finite sets multi-sensor adaptive birth procedure. Simulation results are provided to demonstrate their computational benefits and show that they result in a negligible change to the multi-target tracking performance.
Authors: Fangzhou Lin, Zilin Dai, Rigved Sanku, Songlin Hou, Kazunori D Yamada, Haichong K. Zhang, Ziming Zhang
The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at this https URL.
Authors: Oluwaseyi Giwa, Muhammad Ahmed Mohsin, Muhammad Ali Jamshed
In this letter, we propose Quantum-Preconditioned Policy Gradient (QPPG), a natural gradient-based algorithm for link adaptation that whitens policy updates using the full inverse quantum Fisher information with Tikhonov regularization. QPPG bridges classical and quantum geometry, achieving stable learning even under noise. Evaluated on classical and quantum environments, including noisy single-qubit Gym tasks and Rayleigh-fading channels, QPPG converges 4 times faster than REINFORCE and sustains a 1 dB gain under uncertainty. It reaches a 90 percent return in one hundred episodes with high noise robustness, showcasing the advantages of full QFI-based preconditioning for scalable quantum reinforcement learning.
Authors: Tahitoa Leygue (DIASI (CEA, LIST)), Astrid Sabourin (DIASI (CEA, LIST)), Christian Bolzmacher (DIASI (CEA, LIST)), Sylvain Bouchigny (DIASI (CEA, LIST)), Margarita Anastassova (DIASI (CEA, LIST)), Quoc-Cuong Pham (DIASI (CEA, LIST))
State-of-the-art transformer models for Speech Emotion Recognition (SER) rely on temporal feature aggregation, yet advanced pooling methods remain underexplored. We systematically benchmark pooling strategies, including Multi-Query Multi-Head Attentive Statistics Pooling, which achieves a 3.5 percentage point macro F1 gain over average pooling. Attention analysis shows 15 percent of frames capture 80 percent of emotion cues, revealing a localized pattern of emotional information. Analysis of high-attention frames reveals that non-linguistic vocalizations and hyperarticulated phonemes are disproportionately prioritized during pooling, mirroring human perceptual strategies. Our findings position attentive pooling as both a performant SER mechanism and a biologically plausible tool for explainable emotion localization. On Interspeech 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, our approach obtained a macro F1 score of 0.3649.
Authors: Siyi Xie, Hanxin Zhu, Tianyu He, Xin Li, Zhibo Chen
Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at this https URL.
Authors: Baxi Chong, Juntao He, Daniel Irvine, Tianyu Wang, Esteban Flores, Daniel Soto, Jianfeng Lin, Zhaochen Xu, Vincent R Nienhusser, Grigoriy Blekherman, Daniel I. Goldman
Modern two and four legged robots exhibit impressive mobility on complex terrain, largely attributed to advancement in learning algorithms. However, these systems often rely on high-bandwidth sensing and onboard computation to perceive/respond to terrain uncertainties. Further, current locomotion strategies typically require extensive robot-specific training, limiting their generalizability across platforms. Building on our prior research connecting robot-environment interaction and communication theory, we develop a new paradigm to construct robust and simply controlled multi-legged elongate robots (MERs) capable of operating effectively in cluttered, unstructured environments. In this framework, each leg-ground contact is thought of as a basic active contact (bac), akin to bits in signal transmission. Reliable locomotion can be achieved in open-loop on "noisy" landscapes via sufficient redundancy in bacs. In such situations, robustness is achieved through passive mechanical responses. We term such processes as those displaying mechanical intelligence (MI) and analogize these processes to forward error correction (FEC) in signal transmission. To augment MI, we develop feedback control schemes, which we refer to as computational intelligence (CI) and such processes analogize automatic repeat request (ARQ) in signal transmission. Integration of these analogies between locomotion and communication theory allow analysis, design, and prediction of embodied intelligence control schemes (integrating MI and CI) in MERs, showing effective and reliable performance (approximately half body lengths per cycle) on complex landscapes with terrain "noise" over twice the robot's height. Our work provides a foundation for systematic development of MER control, paving the way for terrain-agnostic, agile, and resilient robotic systems capable of operating in extreme environments.
Authors: Xiaodan Shao, Limei Hu, Yulong Sun, Xing Li, Yixiao Zhang, Jingze Ding, Xiaoming Shi, Feng Chen, Derrick Wing Kwan Ng, Robert Schober
Six-dimensional movable antenna (6DMA) has been identified as a new disruptive technology for future wireless systems to support a large number of users with only a few antennas. However, the intricate relationships between the signal carrier wavelength and the transceiver region size lead to inaccuracies in traditional far-field 6DMA channel model, causing discrepancies between the model predictions and the hybrid-field channel characteristics in practical 6DMA systems, where users might be in the far-field region relative to the antennas on the same 6DMA surface, while simultaneously being in the near-field region relative to different 6DMA surfaces. Moreover, due to the high-dimensional channel and the coupled position and rotation constraints, the estimation of the 6DMA channel and the joint design of the 6DMA positions and rotations and the transmit beamforming at the base station (BS) incur extremely high computational complexity. To address these issues, we propose an efficient hybrid-field generalized 6DMA channel model, which accounts for planar-wave propagation within individual 6DMA surfaces and spherical-wave propagation among different 6DMA surfaces. Furthermore, by leveraging directional sparsity, we propose a low-overhead channel estimation algorithm that efficiently constructs a complete channel map for all potential antenna position-rotation pairs while limiting the training overhead incurred by antenna movement. In addition, we propose a low-complexity design leveraging deep reinforcement learning (DRL), which facilitates the joint design of the 6DMA positions, rotations, and beamforming in a unified manner. Numerical results demonstrate that the proposed hybrid-field channel model and channel estimation algorithm outperform existing approaches and that the DRL-enhanced 6DMA system significantly surpasses flexible antenna systems.
Authors: Pham Khai Nguyen Do, Bao Nguyen Tran, Nam Nguyen, Duc Dung Nguyen
Recent advances in Novel View Synthesis (NVS) and 3D generation have significantly improved editing tasks, with a primary emphasis on maintaining cross-view consistency throughout the generative process. Contemporary methods typically address this challenge using a dual-strategy framework: performing consistent 2D inpainting across all views guided by embedded priors either explicitly in pixel space or implicitly in latent space; and conducting 3D reconstruction with additional consistency guidance. Previous strategies, in particular, often require an initial 3D reconstruction phase to establish geometric structure, introducing considerable computational overhead. Even with the added cost, the resulting reconstruction quality often remains suboptimal. In this paper, we present VEIGAR, a computationally efficient framework that outperforms existing methods without relying on an initial reconstruction phase. VEIGAR leverages a lightweight foundation model to reliably align priors explicitly in the pixel space. In addition, we introduce a novel supervision strategy based on scale-invariant depth loss, which removes the need for traditional scale-and-shift operations in monocular depth regularization. Through extensive experimentation, VEIGAR establishes a new state-of-the-art benchmark in reconstruction quality and cross-view consistency, while achieving a threefold reduction in training time compared to the fastest existing method, highlighting its superior balance of efficiency and effectiveness.
Authors: Hossein Maghsoumi, Yaser Fallah
Small-scale autonomous vehicle platforms provide a cost-effective environment for developing and testing advanced driving systems. However, specific configurations within this scale are underrepresented, limiting full awareness of their potential. This paper focuses on a one-sixth-scale setup, offering a high-level overview of its design, hardware and software integration, and typical challenges encountered during development. We discuss methods for addressing mechanical and electronic issues common to this scale and propose guidelines for improving reliability and performance. By sharing these insights, we aim to expand the utility of small-scale vehicles for testing autonomous driving algorithms and to encourage further research in this domain.
Authors: Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou
Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
Authors: Israel Charles, Hossein Maghsoumi, Yaser Fallah
The RoboRacer (F1TENTH) platform has emerged as a leading testbed for advancing autonomous driving research, offering a scalable, cost-effective, and community-driven environment for experimentation. This paper presents a comprehensive survey of the platform, analyzing its modular hardware and software architecture, diverse research applications, and role in autonomous systems education. We examine critical aspects such as bridging the simulation-to-reality (Sim2Real) gap, integration with simulation environments, and the availability of standardized datasets and benchmarks. Furthermore, the survey highlights advancements in perception, planning, and control algorithms, as well as insights from global competitions and collaborative research efforts. By consolidating these contributions, this study positions RoboRacer as a versatile framework for accelerating innovation and bridging the gap between theoretical research and real-world deployment. The findings underscore the platform's significance in driving forward developments in autonomous racing and robotics.
Authors: Hang Yang, Yusheng Hu, Yong Liu, Cong (Callie)Hao
Accurate graph similarity is critical for knowledge transfer in VLSI design, enabling the reuse of prior solutions to reduce engineering effort and turnaround time. We propose Pieceformer, a scalable, self-supervised similarity assessment framework, equipped with a hybrid message-passing and graph transformer encoder. To address transformer scalability, we incorporate a linear transformer backbone and introduce a partitioned training pipeline for efficient memory and parallelism management. Evaluations on synthetic and real-world CircuitNet datasets show that Pieceformer reduces mean absolute error (MAE) by 24.9% over the baseline and is the only method to correctly cluster all real-world design groups. We further demonstrate the practical usage of our model through a case study on a partitioning task, achieving up to 89% runtime reduction. These results validate the framework's effectiveness for scalable, unbiased design reuse in modern VLSI systems.
Authors: Zifei Xu, Sayeh Sharify, Hesham Mostafa, Tristan Webb, Wanzin Yazar, Xin Wang
Transformer-based neural speech processing has achieved state-of-the-art performance. Since speech audio signals are known to be highly compressible, here we seek to accelerate neural speech transcription by time-domain signal sparsification early in the neural encoding stage, taking advantage of the interpretability of the self-attention mechanism in transformer audio encoders. With the Whisper family of models, we perform a systematic architecture search over the joint space of sparsification stage (a certain encoder layer) and compression ratio (sparsity). We found that the best resulting solutions under 1% accuracy degradation choose to sparsify the hidden state to 40-60% sparsity at an early encoding stage, and thereby achieve up to 1.6x runtime acceleration in English speech transcription tasks on Nvidia GPUs without any fine-tuning.
Authors: Liangyan Li, Yimo Ning, Kevin Le, Wei Dong, Yunzhe Li, Jun Chen, Xiaohong Liu
This paper introduces a novel framework for image and video demoiréing by integrating Maximum A Posteriori (MAP) estimation with advanced deep learning techniques. Demoiréing addresses inherently nonlinear degradation processes, which pose significant challenges for existing methods. Traditional supervised learning approaches either fail to remove moiré patterns completely or produce overly smooth results. This stems from constrained model capacity and scarce training data, which inadequately represent the clean image distribution and hinder accurate reconstruction of ground-truth images. While generative models excel in image restoration for linear degradations, they struggle with nonlinear cases such as demoiréing and often introduce artifacts. To address these limitations, we propose a hybrid MAP-based framework that integrates two complementary components. The first is a supervised learning model enhanced with efficient linear attention Test-Time Training (TTT) modules, which directly learn nonlinear mappings for RAW-to-sRGB demoiréing. The second is a Truncated Flow Matching Prior (TFMP) that further refines the outputs by aligning them with the clean image distribution, effectively restoring high-frequency details and suppressing artifacts. These two components combine the computational efficiency of linear attention with the refinement abilities of generative models, resulting in improved restoration performance.
Authors: Jinbo Wen, Cheng Su, Jiawen Kang, Jiangtian Nie, Yang Zhang, Jianhang Tang, Dusit Niyato, Chau Yuen
Low-Altitude Economy Networks (LAENets) are emerging as a promising paradigm to support various low-altitude services through integrated air-ground infrastructure. To satisfy low-latency and high-computation demands, the integration of Unmanned Aerial Vehicles (UAVs) with Mobile Edge Computing (MEC) systems plays a vital role, which offloads computing tasks from terminal devices to nearby UAVs, enabling flexible and resilient service provisions for ground users. To promote the development of LAENets, it is significant to achieve low-carbon multi-UAV-assisted MEC networks. However, several challenges hinder this implementation, including the complexity of multi-dimensional UAV modeling and the difficulty of multi-objective coupled optimization. To this end, this paper proposes a novel Retrieval Augmented Generation (RAG)-based Large Language Model (LLM) agent framework for model formulation. Specifically, we develop HybridRAG by combining KeywordRAG, VectorRAG, and GraphRAG, empowering LLM agents to efficiently retrieve structural information from expert databases and generate more accurate optimization problems compared with traditional RAG-based LLM agents. After customizing carbon emission optimization problems for multi-UAV-assisted MEC networks, we propose a Double Regularization Diffusion-enhanced Soft Actor-Critic (R\textsuperscript{2}DSAC) algorithm to solve the formulated multi-objective optimization problem. The R\textsuperscript{2}DSAC algorithm incorporates diffusion entropy regularization and action entropy regularization to improve the performance of the diffusion policy. Furthermore, we dynamically mask unimportant neurons in the actor network to reduce the carbon emissions associated with model training. Simulation results demonstrate the effectiveness and reliability of the proposed HybridRAG-based LLM agent framework and the R\textsuperscript{2}DSAC algorithm.
Authors: Connor Ding, Abhiram Rao Gorle, Jiwon Jeong, Naomi Sagan, Tsachy Weissman
In this work, we explore the interplay between information and computation in non-linear transform-based compression for broad classes of modern information-processing tasks. We first investigate two emerging nonlinear data transformation frameworks for image compression: Implicit Neural Representations (INRs) and 2D Gaussian Splatting (GS). We analyze their representational properties, behavior under lossy compression, and convergence dynamics. Our results highlight key trade-offs between INR's compact, resolution-flexible neural field representations and GS's highly parallelizable, spatially interpretable fitting, providing insights for future hybrid and compression-aware frameworks. Next, we introduce the textual transform that enables efficient compression at ultra-low bitrate regimes and simultaneously enhances human perceptual satisfaction. When combined with the concept of denoising via lossy compression, the textual transform becomes a powerful tool for denoising tasks. Finally, we present a Lempel-Ziv (LZ78) "transform", a universal method that, when applied to any member of a broad compressor family, produces new compressors that retain the asymptotic universality guarantees of the LZ78 algorithm. Collectively, these three transforms illuminate the fundamental trade-offs between coding efficiency and computational cost. We discuss how these insights extend beyond compression to tasks such as classification, denoising, and generative AI, suggesting new pathways for using non-linear transformations to balance resource constraints and performance.
Authors: Markus Frohmann, Gabriel Meseguer-Brocal, Markus Schedl, Elena V. Epure
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at this https URL.
Authors: Jianzhu Huai, Yuxin Shao, Yujia Zhang, Alper Yilmaz
The rapid advancement of the metaverse, digital twins, and robotics underscores the demand for low-cost, portable mapping systems for reality capture. Current mobile solutions, such as the Leica BLK2Go and lidar-equipped smartphones, either come at a high cost or are limited in range and accuracy. Leveraging the proliferation and technological evolution of mobile devices alongside recent advancements in lidar technology, we introduce a novel, low-cost, portable mobile mapping system. Our system integrates a lidar unit, an Android smartphone, and an RTK-GNSS stick. Running on the Android platform, it features lidar-inertial odometry built with the NDK, and logs data from the lidar, wide-angle camera, IMU, and GNSS. With a total bill of materials (BOM) cost under 2,000 USD and a weight of about 1 kilogram, the system achieves a good balance between affordability and portability. We detail the system design, multisensor calibration, synchronization, and evaluate its performance for tracking and mapping. To further contribute to the community, the system's design and software are made open source at: this https URL
Authors: William Sharpless, Dylan Hirsch, Sander Tonkens, Nikhil Shinde, Sylvia Herbert
Hard constraints in reinforcement learning (RL), whether imposed via the reward function or the model architecture, often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but often require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: (1) the Reach-Always-Avoid problem - of achieving distinct reward and penalty thresholds - and (2) the Reach-Reach problem - of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable Bellman forms in this context by decomposing our problem into reach, avoid, and reach-avoid problems, as to leverage these aforementioned recent advances. From a mathematical perspective, the Reach-Always-Avoid and Reach-Reach problems are complementary and fundamentally different from standard sum-of-rewards problems and temporal logic problems, providing a new perspective on constrained decision-making. We leverage our analysis to propose a variation of Proximal Policy Optimization (DO-HJ-PPO), which solves these problems. Across a range of tasks for safe-arrival and multi-target achievement, we demonstrate that DO-HJ-PPO produces qualitatively distinct behaviors from previous approaches and out-competes a number of baselines in various metrics.
Authors: Zijing Zhao, Kai Wang, Hao Huang, Ying Hu, Liang He, Jichen Yang
To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a linguistic representation enriched with spatial information. Secondly, the decoder employs a consistency Schrödinger bridge to facilitate one-step sample generation. Moreover, we utilize the SFE module to improve the consistency of audio-visual matching. To our knowledge, this study is the first to combine stereo singing voice synthesis with visual acoustic matching within a unified framework. Experimental results demonstrate that VS-Singer can effectively generate stereo singing voices that align with the scene perspective in a single step.
Authors: Zhen Qin, Michael B. Wakin, Zhihui Zhu
Tensor decompositions, which represent an $N$-order tensor using approximately $N$ factors of much smaller dimensions, can significantly reduce the number of parameters. This is particularly beneficial for high-order tensors, as the number of entries in a tensor grows exponentially with the order. Consequently, they are widely used in signal recovery and data analysis across domains such as signal processing, machine learning, and quantum physics. A computationally and memory-efficient approach to these problems is to optimize directly over the factors using local search algorithms such as gradient descent, a strategy known as the factorization approach in matrix and tensor optimization. However, the resulting optimization problems are highly nonconvex due to the multiplicative interactions between factors, posing significant challenges for convergence analysis and recovery guarantees. In this paper, we present a unified framework for the factorization approach to solving various tensor decomposition problems. Specifically, by leveraging the canonical form of tensor decompositions--where most factors are constrained to be orthonormal to mitigate scaling ambiguity--we apply Riemannian gradient descent (RGD) to optimize these orthonormal factors on the Stiefel manifold. Under a mild condition on the loss function, we establish a Riemannian regularity condition for the factorized objective and prove that RGD converges to the ground-truth tensor at a linear rate when properly initialized. Notably, both the initialization requirement and the convergence rate scale polynomially rather than exponentially with $N$, improving upon existing results for Tucker and tensor-train format tensors.
Authors: Masahiko Ueda, Shoma Yagi, Genki Ichinose
An oligopoly is a market in which the price of a goods is controlled by a few firms. Cournot introduced the simplest game-theoretic model of oligopoly, where profit-maximizing behavior of each firm results in market failure. Furthermore, when the Cournot oligopoly game is infinitely repeated, firms can tacitly collude to monopolize the market. Such tacit collusion is realized by the same mechanism as direct reciprocity in the repeated prisoner's dilemma game, where mutual cooperation can be realized whereas defection is favorable for both prisoners in one-shot game. Recently, in the repeated prisoner's dilemma game, a class of strategies called zero-determinant strategies attracts much attention in the context of direct reciprocity. Zero-determinant strategies are autocratic strategies which unilaterally control payoffs of players. There were many attempts to find zero-determinant strategies in other games and to extend them so as to apply them to broader situations. In this paper, first, we show that zero-determinant strategies exist even in the repeated Cournot oligopoly game. Especially, we prove that an averagely unbeatable zero-determinant strategy exists, which is guaranteed to obtain the average payoff of the opponents. Second, we numerically show that the averagely unbeatable zero-determinant strategy can be used to promote collusion when it is used against an adaptively learning player, whereas it cannot promote collusion when it is used against two adaptively learning players. Our findings elucidate some negative impact of zero-determinant strategies in oligopoly market.
Authors: Shrinivas Chimmalgi, Laurent Schmalen, Vahid Aref
Probabilistic constellation shaping enables easy rate adaption and has been proven to reduce the gap to Shannon capacity. Constellation point probabilities are optimized to maximize either the mutual information or the bit-wise mutual information. The optimization problem is however challenging even for simple channel models. While autoencoder-based machine learning has been applied successfully to solve this problem [1], it requires manual computation of additional terms for the gradient which is an error-prone task. In this work, we present novel loss functions for autoencoder-based learning of probabilistic constellation shaping for coded modulation systems using automatic differentiation and importance sampling. We show analytically that our proposed approach also uses exact gradients of the constellation point probabilities for the optimization. In simulations, our results closely match the results from [1] for the additive white Gaussian noise channel and a simplified model of the intensity-modulation direct-detection channel.
Authors: Shoutrik Das, Nishant Singh, Arjun Gangwar, S Umesh
Dysarthria is a neurological disorder that significantly impairs speech intelligibility, often rendering affected individuals unable to communicate effectively. This necessitates the development of robust dysarthric-to-regular speech conversion techniques. In this work, we investigate the utility and limitations of self-supervised learning (SSL) features and their quantized representations as an alternative to mel-spectrograms for speech generation. Additionally, we explore methods to mitigate speaker variability by generating clean speech in a single-speaker voice using features extracted from WavLM. To this end, we propose a fully non-autoregressive approach that leverages Conditional Flow Matching (CFM) with Diffusion Transformers to learn a direct mapping from dysarthric to clean speech. Our findings highlight the effectiveness of discrete acoustic units in improving intelligibility while achieving faster convergence compared to traditional mel-spectrogram-based approaches.
Authors: Jiang Wang, Runwu Shi, Benjamin Yen, He Kong, Kazuhiro Nakadai
Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localization method that uses a single microphone mounted on a mobile robot in reverberant environments. Specifically, we develop a lightweight neural network model with only 43k parameters to perform real-time distance estimation by extracting temporal information from reverberant signals. The estimated distances are then processed using an extended Kalman filter to achieve online sound source localization. To the best of our knowledge, this is the first work to achieve online sound source localization using a single microphone on a moving robot, a gap that we aim to fill in this work. Extensive experiments demonstrate the effectiveness and merits of our approach. To benefit the broader research community, we have open-sourced our code at this https URL.
Authors: Jiale Liu, Dandan Peng, Huan Wang, Chenyu Liu, Yan-Fu Li, Min Xie
Aerospace engines, as critical components in aviation and aerospace industries, require continuous and accurate fault diagnosis to ensure operational safety and prevent catastrophic failures. While deep learning techniques have been extensively studied in this context, they output logits or confidence scores, necessitating post-processing to derive actionable insights. Furthermore, the potential of large-scale audio models in this domain remains largely untapped. To address these limitations, this paper proposes AeroGPT, a novel framework that transfers knowledge from general audio domain to aero-engine bearing fault diagnosis. AeroGPT is a framework based on large-scale audio model that incorporates Vibration Signal Alignment (VSA) to adapt general audio knowledge to domain-specific vibration patterns, and combines Generative Fault Classification (GFC) to directly output interpretable fault labels. This approach eliminates the need for post-processing of fault labels, supports interactive, interpretable, and actionable fault diagnosis, thereby greatly enhancing industrial applicability. Through comprehensive experimental validation on two aero-engine bearing datasets, AeroGPT achieved exceptional performance with 98.94% accuracy on the DIRG dataset and perfect 100% classification on the HIT bearing dataset, surpassing traditional deep learning approaches. Additional Qualitative analysis validates the effectiveness of our approach and highlights the potential of large-scale models to revolutionize fault diagnosis.
Authors: Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju, Anil Kumar Vuppala
The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.
Authors: Zhaoyi Wang, Jemil Avers Butt, Shengyu Huang, Tomislav Medic, Andreas Wieser
Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at this https URL.
Authors: Hao-Chien Lu, Jhen-Ke Lin, Hong-Yun Lin, Chung-Chun Wang, Berlin Chen
Current automated speaking assessment (ASA) systems for use in multi-aspect evaluations often fail to make full use of content relevance, overlooking image or exemplar cues, and employ superficial grammar analysis that lacks detailed error types. This paper ameliorates these deficiencies by introducing two novel enhancements to construct a hybrid scoring model. First, a multifaceted relevance module integrates question and the associated image content, exemplar, and spoken response of an L2 speaker for a comprehensive assessment of content relevance. Second, fine-grained grammar error features are derived using advanced grammar error correction (GEC) and detailed annotation to identify specific error categories. Experiments and ablation studies demonstrate that these components significantly improve the evaluation of content relevance, language use, and overall ASA performance, highlighting the benefits of using richer, more nuanced feature sets for holistic speaking assessment.
Authors: Xu Zhao, Chen Zhao, Xiantao Hu, Hongliang Zhang, Ying Tai, Jian Yang
Recent advancements in multi-scale architectures have demonstrated exceptional performance in image denoising tasks. However, existing architectures mainly depends on a fixed single-input single-output Unet architecture, ignoring the multi-scale representations of pixel level. In addition, previous methods treat the frequency domain uniformly, ignoring the different characteristics of high-frequency and low-frequency noise. In this paper, we propose a novel multi-scale adaptive dual-domain network (MADNet) for image denoising. We use image pyramid inputs to restore noise-free results from low-resolution images. In order to realize the interaction of high-frequency and low-frequency information, we design an adaptive spatial-frequency learning unit (ASFU), where a learnable mask is used to separate the information into high-frequency and low-frequency components. In the skip connections, we design a global feature fusion block to enhance the features at different scales. Extensive experiments on both synthetic and real noisy image datasets verify the effectiveness of MADNet compared with current state-of-the-art denoising approaches.
Authors: Pranav Pawar, Akshansh Dwivedi, Jenish Boricha, Himanshu Gohil, Aditya Dubey
State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about
Authors: Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu
In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.
Authors: Gonçalo Granjal Cruz, Balazs Renczes, Mark C Runacres, Jan Decuyper
While accurate, black-box system identification models lack interpretability of the underlying system dynamics. This paper proposes State-Space Kolmogorov-Arnold Networks (SS-KAN) to address this challenge by integrating Kolmogorov-Arnold Networks within a state-space framework. The proposed model is validated on two benchmark systems: the Silverbox and the Wiener-Hammerstein benchmarks. Results show that SS-KAN provides enhanced interpretability due to sparsity-promoting regularization and the direct visualization of its learned univariate functions, which reveal system nonlinearities at the cost of accuracy when compared to state-of-the-art black-box models, highlighting SS-KAN as a promising approach for interpretable nonlinear system identification, balancing accuracy and interpretability of nonlinear system dynamics.
Authors: Berk Yilmaz, Daniel Fidel Harvey, Prajit Dhuri
This study investigates the integration of signal processing transformations -- Fast Fourier Transform (FFT), Walsh-Hadamard Transform (WHT), and Discrete Cosine Transform (DCT) -- within the ResNet50 convolutional neural network (CNN) model for image classification. The primary objective is to assess the trade-offs between computational efficiency, energy consumption, and classification accuracy during training and inference. Using the CIFAR-100 dataset (100 classes, 60,000 images), experiments demonstrated that incorporating WHT significantly reduced energy consumption while improving accuracy. Specifically, a baseline ResNet50 model achieved a testing accuracy of 66%, consuming an average of 25,606 kJ per model. In contrast, a modified ResNet50 incorporating WHT in the early convolutional layers achieved 74% accuracy, and an enhanced version with WHT applied to both early and late layers achieved 79% accuracy, with an average energy consumption of only 39 kJ per model. These results demonstrate the potential of WHT as a highly efficient and effective approach for energy-constrained CNN applications.
Authors: Mohamad Hachem, Clément Roos, Thierry Miquel, Murat Bronz
This paper presents a robust cascaded control architecture for over-actuated multirotors. It extends the Incremental Nonlinear Dynamic Inversion (INDI) control combined with structured H_inf control, initially proposed for under-actuated multirotors, to a broader range of multirotor configurations. To achieve precise and robust attitude and position tracking, we employ a weighted least-squares geometric guidance control allocation method, formulated as a quadratic optimization problem, enabling full-pose tracking. The proposed approach effectively addresses key challenges, such as preventing infeasible pose references and enhancing robustness against disturbances, as well as considering multirotor's actual physical limitations. Numerical simulations with an over-actuated hexacopter validate the method's effectiveness, demonstrating its adaptability to diverse mission scenarios and its potential for real-world aerial applications.
Authors: Vlad Cnejevici, Matthias Ponfick, Raul C. Sîmpetru, Alessandro Del Vecchio
Restoring movement of a paralyzed foot is a key challenge in helping individuals with neurological conditions such as spinal cord injury (SCI) to improve their quality of life. Neuroprostheses based on functional electrical stimulation (FES) can restore the physiological range of motion by stimulating the affected muscles using surface electrodes. We have previously shown that, despite chronic motor-complete SCI, it is possible to capture paralyzed hand movements in individuals with tetraplegia using spared and modulated motor unit (MU) activity decoded with non-invasive electromyography (EMG) sensors. This study investigated whether a wearable high-density surface EMG system could capture and control paralyzed foot kinematics in closed-loop control with an FES system. We found that all our participants with SCI (2 with chronic SCI and 3 with acute SCI) retained distinct spared EMG activity for at least three ankle movements, which allowed them to reliably control a digital cursor using their spared tibialis anterior and triceps surae MU activity. Movement separability was further reconfirmed by extracting task-modulated MU activity during foot flexion/extension (3-7 modulated MUs/participant). Three participants were further able to modulate and maintain their foot flexion/extension EMG levels with an accuracy of >70%. Lastly, we show that real-time control of a FES system using EMG from the affected limb can restore foot movements in a highly intuitive way, significantly improving the lost or pathological foot range of motion. Our system provides an intuitive approach for closed-loop control of FES that has the potential to assist individuals with SCI in regaining lost motor functions.
Authors: Amir Reza Vazifeh, Jason W. Fleischer
Electrocardiograms (ECGs) provide direct, non-invasive measurements of heart activity and are well-established tools for detecting and monitoring cardiovascular disease. However, manual ECG analysis can be time-consuming and prone to errors. Machine learning has emerged as a promising approach for automated heartbeat recognition and classification, but substantial variations in ECG signals make it challenging to develop generalizable models. ECG signals can vary widely across individuals and leads, while datasets often follow different labeling standards and may be biased, all of which greatly hinder supervised methods. Conventional unsupervised methods, e.g. principal component analysis, prioritize large (and often obvious) variances in the data and typically overlook subtle yet clinically relevant patterns. If labels are missing and/or variations are significant but small, both approaches fail. Here, we show that nonlinear dimensionality reduction (NLDR) can accommodate these issues and identify medically relevant features in ECG signals, with no need for training or prior information. Using the MLII and V1 leads of the MIT-BIH dataset, we demonstrate that t-distributed stochastic neighbor embedding and uniform manifold approximation and projection can discriminate individual recordings in mixed populations with >= 90% accuracy and distinguish different arrhythmias in individual patients with a median accuracy of 98.96% and a median F1-score of 91.02%. The results show that NLDR holds much promise for cardiac monitoring, including the limiting cases of single-lead ECG and the current 12-lead standard of care, and for personalized health care beyond cardiology.
Authors: Tyler Landle, Jordan Rapp, Dean Blank, Chandramouli Amarnath, Abhijit Chatterjee, Alex Daglis, Umakishore Ramachandran
As autonomous vehicles edge closer to widespread adoption, enhancing road safety through collision avoidance and minimization of collateral damage becomes imperative. Vehicle-to-everything (V2X) technologies, which include vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), and vehicle-to-cloud (V2C), are being proposed as mechanisms to achieve this safety improvement. Simulation-based testing is crucial for early-stage evaluation of Connected Autonomous Vehicle (CAV) control systems, offering a safer and more cost-effective alternative to real-world tests. However, simulating large 3D environments with many complex single- and multi-vehicle sensors and controllers is computationally intensive. There is currently no evaluation framework that can effectively evaluate realistic scenarios involving large numbers of autonomous vehicles. We propose eCAV -- an efficient, modular, and scalable evaluation platform to facilitate both functional validation of algorithmic approaches to increasing road safety, as well as performance prediction of algorithms of various V2X technologies, including a futuristic Vehicle-to-Edge control plane and correspondingly designed control algorithms. eCAV can model up to 256 vehicles running individual control algorithms without perception enabled, which is $8\times$ more vehicles than what is possible with state-of-the-art alternatives. %faster than state-of-the-art alternatives that can simulate $8\times$ fewer vehicles. With perception enabled, eCAV simulates up to 64 vehicles with a step time under 800ms, which is $4\times$ more and $1.5\times$ faster than the state-of-the-art OpenCDA framework.
Authors: Sreeja Roy-Singh, Alan P. Li, Vinay Ravindra, Roderick Lammers, Marc Sanchez Net
Fully re-orientable small spacecraft are now supported by commercial technologies, allowing them to point their instruments in any direction and capture images, with short notice. When combined with improved onboard processing, and implemented on a constellation of inter-communicable satellites, this intelligent agility can significantly increase responsiveness to transient or evolving phenomena. We demonstrate a ground-based and onboard algorithmic framework that combines orbital mechanics, attitude control, inter-satellite communication, intelligent prediction and planning to schedule the time-varying, re-orientation of agile, small satellites in a constellation. Planner intelligence is improved by updating the predictive value of future space-time observations based on shared observations of evolving episodic precipitation and urban flood forecasts. Reliable inter-satellite communication within a fast, dynamic constellation topology is modeled in the physical, access control and network layer. We apply the framework on a representative 24-satellite constellation observing 5 global regions. Results show appropriately low latency in information exchange (average within 1/3rd available time for implicit consensus), enabling the onboard scheduler to observe ~7% more flood magnitude than a ground-based implementation. Both onboard and offline versions performed ~98% better than constellations without agility.
Authors: Yunkee Chae, Kyogu Lee
Residual Vector Quantization (RVQ) has become a dominant approach in neural speech and audio coding, providing high-fidelity compression. However, speech coding presents additional challenges due to real-world noise, which degrades compression efficiency. Standard codecs allocate bits uniformly, wasting bitrate on noise components that do not contribute to intelligibility. This paper introduces a Variable Bitrate RVQ (VRVQ) framework for noise-robust speech coding, dynamically adjusting bitrate per frame to optimize rate-distortion trade-offs. Unlike constant bitrate (CBR) RVQ, our method prioritizes critical speech components while suppressing residual noise. Additionally, we integrate a feature denoiser to further improve noise robustness. Experimental results show that VRVQ improves rate-distortion trade-offs over conventional methods, achieving better compression efficiency and perceptual quality in noisy conditions. Samples are available at our project page: this https URL.
Authors: Liyang Yu, Tianyi Wang, Junfeng Jiao, Fengwu Shan, Hongqing Chu, Bingzhao Gao
In complex real-world traffic environments, autonomous vehicles (AVs) need to interact with other traffic participants while making real-time and safety-critical decisions accordingly. The unpredictability of human behaviors poses significant challenges, particularly in dynamic scenarios, such as multi-lane highways and unsignalized T-intersections. To address this gap, we design a bi-level interaction decision-making algorithm (BIDA) that integrates interactive Monte Carlo tree search (MCTS) with deep reinforcement learning (DRL), aiming to enhance interaction rationality, efficiency and safety of AVs in dynamic key traffic scenarios. Specifically, we adopt three types of DRL algorithms to construct a reliable value network and policy network, which guide the online deduction process of interactive MCTS by assisting in value update and node selection. Then, a dynamic trajectory planner and a trajectory tracking controller are designed and implemented in CARLA to ensure smooth execution of planned maneuvers. Experimental evaluations demonstrate that our BIDA not only enhances interactive deduction and reduces computational costs, but also outperforms other latest benchmarks, which exhibits superior safety, efficiency and interaction rationality under varying traffic conditions.
Authors: Melih Özcan, Ozgur S. Oguz
Robotic manipulation demands precise control over both contact forces and motion trajectories. While force control is essential for achieving compliant interaction and high-frequency adaptation, it is limited to operations in close proximity to the manipulated object and often fails to maintain stable orientation during extended motion sequences. Conversely, optimization-based motion planning excels in generating collision-free trajectories over the robot's configuration space but struggles with dynamic interactions where contact forces play a crucial role. To address these limitations, we propose a multi-modal control framework that combines force control and optimization-augmented motion planning to tackle complex robotic manipulation tasks in a sequential manner, enabling seamless switching between control modes based on task requirements. Our approach decomposes complex tasks into subtasks, each dynamically assigned to one of three control modes: Pure optimization for global motion planning, pure force control for precise interaction, or hybrid control for tasks requiring simultaneous trajectory tracking and force regulation. This framework is particularly advantageous for bimanual and multi-arm manipulation, where synchronous motion and coordination among arms are essential while considering both the manipulated object and environmental constraints. We demonstrate the versatility of our method through a range of long-horizon manipulation tasks, including single-arm, bimanual, and multi-arm applications, highlighting its ability to handle both free-space motion and contact-rich manipulation with robustness and precision.
Authors: Dana Serditova, Kevin Tang, Jochen Steffens
Automatic Speech Recognition (ASR) systems struggle with regional dialects due to biased training which favours mainstream varieties. While previous research has identified racial, age, and gender biases in ASR, regional bias remains underexamined. This study investigates ASR performance on Newcastle English, a well-documented regional dialect known to be challenging for ASR. A two-stage analysis was conducted: first, a manual error analysis on a subsample identified key phonological, lexical, and morphosyntactic errors behind ASR misrecognitions; second, a case study focused on the systematic analysis of ASR recognition of the regional pronouns ``yous'' and ``wor''. Results show that ASR errors directly correlate with regional dialectal features, while social factors play a lesser role in ASR mismatches. We advocate for greater dialectal diversity in ASR training data and highlight the value of sociolinguistic analysis in diagnosing and addressing regional biases.
Authors: Mattia Bianchi, Florian Dörfler
Online Feedback Optimization (OFO) steers a dynamical plant to a cost-efficient steady-state, only relying on input-output sensitivity information, rather than on a full plant model. Unlike traditional feedforward approaches, OFO leverages real-time measurements from the plant, thereby inheriting the robustness and adaptability of feedback control. Unfortunately, existing theoretical guarantees for OFO assumes that the controller operates on a slower timescale than the plant, which can affect responsiveness and transient performance. In this paper, we focus on relaxing this ``timescale separation'' assumption. Specifically, we consider the class of monotone systems, and we prove that OFO can achieve an optimal operating point, regardless of the time constants of controller and plant. By leveraging a small gain theorem for monotone systems, we derive several sufficient conditions for global convergence. Notably, these conditions depend only on the steady-state behavior of the plant, and they are entirely independent of the transient dynamics.
Authors: Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel
Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemingly insignificant disruption to the weights can destructively harm the quality of the models. Inspired by the ability of human brains to learn and consolidate knowledge through the waking-sleeping cycle, we propose a continual learning approach with two distinct phases: factorization and centralization, learning and merging knowledge accordingly. Our experiments on a sequence of varied code-switching datasets showed that the centralization stage can effectively prevent catastrophic forgetting by accumulating the knowledge in multiple scattering low-rank adapters.
Authors: Tuan-Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel
We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.
Authors: Manno Versluis, Yizhuo Wu, Chang Gao
Digital predistortion (DPD) is crucial for linearizing radio frequency (RF) power amplifiers (PAs), improving signal integrity and efficiency in wireless systems. Neural network (NN)-based DPD methods surpass traditional polynomial models but face computational challenges limiting their practical deployment. This paper introduces SparseDPD, an FPGA accelerator employing a spatially sparse phase-normalized time-delay neural network (PNTDNN), optimized through unstructured pruning to reduce computational load without accuracy loss. Implemented on a Xilinx Zynq-7Z010 FPGA, SparseDPD operates at 170 MHz, achieving exceptional linearization performance (ACPR: -59.4 dBc, EVM: -54.0 dBc, NMSE: -48.2 dB) with only 241 mW dynamic power, using 64 parameters with 74% sparsity. This work demonstrates FPGA-based acceleration, making NN-based DPD practical and efficient for real-time wireless communication applications. Code is publicly available at this https URL.
Authors: Nicolas Samson, William Larrivée-Hardy, William Dubois, Élie Roy-Brouard, Edith Brotherton, Dominic Baril, Julien Lépine, François Pomerleau
Off-road autonomous navigation is a challenging task as it is mainly dependent on the accuracy of the motion model. Motion model performances are limited by their ability to predict the interaction between the terrain and the UGV, which an onboard sensor can not directly measure. In this work, we propose using the DRIVE protocol to standardize the collection of data for system identification and characterization of the slip state space. We validated this protocol by acquiring a dataset with two platforms (from 75 kg to 470 kg) on six terrains (i.e., asphalt, grass, gravel, ice, mud, sand) for a total of 4.9 hours and 14.7 km. Using this data, we evaluate the DRIVE protocol's ability to explore the velocity command space and identify the reachable velocities for terrain-robot interactions. We investigated the transfer function between the command velocity space and the resulting steady-state slip for an SSMR. An unpredictability metric is proposed to estimate command uncertainty and help assess risk likelihood and severity in deployment. Finally, we share our lessons learned on running system identification on large UGV to help the community.
Authors: Muhammad Azeem Aslam, Muhammad Hamza, Nisar Ahmed, Gulshan Saleem, Zhu Shuangtong, Hu Hongfei, Xu Wei, Saba Aslam, Wang Jun
Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions. This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning. The model performs three key contributions: pre-training Convolutional Neural Networks (CNNs) on a quality-aware dataset, implementing a quality-aware loss function to optimize predictions, and integrating a meta-learner to form an ensemble model that effectively combines predictions from multiple base models. Experimental evaluations were conducted on three benchmark datasets: LiveCD, KonIQ-10K, and BIQ2021. The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing IQA methods. Cross-dataset evaluations further demonstrated the generalizability of the model, with PLCC and SROCC scores ranging from 0.6721 to 0.8023 and 0.6515 to 0.7805, respectively, across diverse datasets. The ablation study confirmed the significance of each model component, revealing substantial performance degradation when critical elements such as the meta-learner or quality-aware loss function were omitted. MetaQAP not only addresses the complexities of authentic distortions but also establishes a robust and generalizable framework for practical IQA applications. By advancing the state-of-the-art in no-reference IQA, this research provides valuable insights and methodologies for future improvements and extensions in the field.
Authors: Shoichi Koyama, Kenji Ishizuka
A learning-based method for estimating the magnitude distribution of sound fields from spatially sparse measurements is proposed. Estimating the magnitude distribution of acoustic transfer function (ATF) is useful when phase measurements are unreliable or inaccessible and has a wide range of applications related to spatial audio. We propose a neural-network-based method for the ATF magnitude estimation. The key feature of our network architecture is the input and output layers conditioned on source and receiver positions and frequency and the aggregation module of latent variables, which can be interpreted as an autoencoder-based extension of the basis expansion of the sound field. Numerical simulation results indicated that the ATF magnitude is accurately estimated with a small number of receivers by our proposed method.
Authors: Yunshan Li, Wenwu Gong, Qianqian Wang, Chao Wang, Lili Yang
Recent approaches based on transform-based tensor nuclear norm (TNN) have demonstrated notable effectiveness in hyperspectral image (HSI) inpainting by leveraging low-rank structures in latent representations. Recent developments incorporate deep transforms to improve low-rank tensor representation; however, existing approaches typically restrict the transform to the spectral mode, neglecting low-rank properties along other tensor modes. In this paper, we propose a novel 3-directional deep low-rank tensor representation (3DeepRep) model, which performs deep nonlinear transforms along all three modes of the HSI tensor. To enforce low-rankness, the model minimizes the nuclear norms of mode-i frontal slices in the corresponding latent space for each direction (i=1,2,3), forming a 3-directional TNN regularization. The outputs from the three directional branches are subsequently fused via a learnable aggregation module to produce the final result. An efficient gradient-based optimization algorithm is developed to solve the model in a self-supervised manner. Extensive experiments on real-world HSI datasets demonstrate that the proposed method achieves superior inpainting performance compared to existing state-of-the-art techniques, both qualitatively and quantitatively.
Authors: Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim
With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.
Authors: Eion Tyacke, Kunal Gupta, Jay Patel, Rui Li
Hand gestures are a primary output of the human motor system, yet the decoding of their neuromuscular signatures remains a bottleneck for basic neuroscience and assistive technologies such as prosthetics. Traditional human-machine interface pipelines rely on a single biosignal modality, but multimodal fusion can exploit complementary information from sensors. We systematically compare linear and attention-based fusion strategies across three architectures: a Multimodal MLP, a Multimodal Transformer, and a Hierarchical Transformer, evaluating performance on scenarios with unimodal and multimodal inputs. Experiments use two publicly available datasets: NinaPro DB2 (sEMG and accelerometer) and HD-sEMG 65-Gesture (high-density sEMG and force). Across both datasets, the Hierarchical Transformer with attention-based fusion consistently achieved the highest accuracy, surpassing the multimodal and best single-modality linear-fusion MLP baseline by over 10% on NinaPro DB2 and 3.7% on HD-sEMG. To investigate how modalities interact, we introduce an Isolation Network that selectively silences unimodal or cross-modal attention pathways, quantifying each group of token interactions' contribution to downstream decisions. Ablations reveal that cross-modal interactions contribute approximately 30% of the decision signal across transformer layers, highlighting the importance of attention-driven fusion in harnessing complementary modality information. Together, these findings reveal when and how multimodal fusion would enhance biosignal classification and also provides mechanistic insights of human muscle activities. The study would be beneficial in the design of sensor arrays for neurorobotic systems.
Authors: Jianyuan Feng, Guangzheng Li, Yangfei Xu
Language-queried Audio Separation (LASS) employs linguistic queries to isolate target sounds based on semantic descriptions. However, existing methods face challenges in aligning complex auditory features with linguistic context while preserving separation precision. Current research efforts focus primarily on text description augmentation and architectural innovations, yet the potential of integrating pre-trained self-supervised learning (SSL) audio models and Contrastive Language-Audio Pretraining (CLAP) frameworks, capable of extracting cross-modal audio-text relationships, remains underexplored. To address this, we present HybridSep, a two-stage LASS framework that synergizes SSL-based acoustic representations with CLAP-derived semantic embeddings. Our framework introduces Adversarial Consistent Training (ACT), a novel optimization strategy that treats diffusion as an auxiliary regularization loss while integrating adversarial training to enhance separation fidelity. Experiments demonstrate that HybridSep achieves significant performance improvements over state-of-the-art baselines (e.g., AudioSep, FlowSep) across multiple metrics, establishing new benchmarks for LASS tasks.
Authors: Junghyun Koo, Marco A. Martinez-Ramirez, Wei-Hsiang Liao, Giorgio Fabbro, Michele Mancusi, Yuki Mitsufuji
Music mastering style transfer aims to model and apply the mastering characteristics of a reference track to a target track, simulating the professional mastering process. However, existing methods apply fixed processing based on a reference track, limiting users' ability to fine-tune the results to match their artistic intent. In this paper, we introduce the ITO-Master framework, a reference-based mastering style transfer system that integrates Inference-Time Optimization (ITO) to enable finer user control over the mastering process. By optimizing the reference embedding during inference, our approach allows users to refine the output dynamically, making micro-level adjustments to achieve more precise mastering results. We explore both black-box and white-box methods for modeling mastering processors and demonstrate that ITO improves mastering performance across different styles. Through objective evaluation, subjective listening tests, and qualitative analysis using text-based conditioning with CLAP embeddings, we validate that ITO enhances mastering style similarity while offering increased adaptability. Our framework provides an effective and user-controllable solution for mastering style transfer, allowing users to refine their results beyond the initial style transfer.
Authors: Partha Chowdhury, Harsha M, Ayush Gupta, Sanat K Biswas
This work presents an indigenous web based platform Orbital Collision (OrCo), created by the Space Systems Laboratory at IIIT Delhi, to enhance Space Situational Awareness (SSA) by predicting collision probabilities of space objects using Two Line Elements (TLE) data. The work highlights the growing challenges of congestion in the Earth's orbital environment, mainly due to space debris and defunct satellites, which increase collision risks. It employs several methods for propagating orbital uncertainty and calculating the collision probability. The performance of the platform is evaluated through accuracy assessments and efficiency metrics, in order to improve the tracking of space objects and ensure the safety of the satellite in congested space.
Authors: Haina Qin, Wenyang Luo, Libin Wang, Dandan Zheng, Jingdong Chen, Ming Yang, Bing Li, Weiming Hu
Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications.
Authors: Yiyang Tie, Hong Zhu, Yunyun Luo, Jing Shi
The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction methods fails to capture both blur and diverse noise characteristics across varying LR distributions, as well as more implicit degradations such as color gamut shifts. Conversely, domain translation alone cannot accurately approximate real-world blur characteristics due to the significant degradation domain gap between synthetic and real data. To address these challenges, we propose a novel TripleGAN framework comprising two strategically designed components: The FirstGAN primarily focuses on narrowing the domain gap in blur characteristics, while the SecondGAN performs domain-specific translation to approximate target-domain blur properties and learn additional degradation patterns. The ThirdGAN is trained on pseudo-real data generated by the FirstGAN and SecondGAN to reconstruct real-world LR images. Extensive experiments on the RealSR and DRealSR datasets demonstrate that our method exhibits clear advantages in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts. The proposed framework effectively learns real-world degradation patterns from LR observations and synthesizes aligned datasets with corresponding degradation characteristics, thereby enabling the trained network to achieve superior performance in reconstructing high-quality SR images from real-world LR inputs.
Authors: Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos
Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize across diverse musical traditions. This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora spanning Western popular, Greek, Turkish, and Indian classical traditions. We employ three complementary methodologies to investigate these models' cross-cultural capabilities: probing to assess inherent representations, targeted supervised fine-tuning of 1-2 layers, and multi-label few-shot learning for low-resource scenarios. Our analysis shows varying cross-cultural generalization, with larger models typically outperforming on non-Western music, though results decline for culturally distant traditions. Notably, our approaches achieve state-of-the-art performance on five out of six evaluated datasets, demonstrating the effectiveness of foundation models for world music understanding. We also find that our targeted fine-tuning approach does not consistently outperform probing across all settings, suggesting foundation models already encode substantial musical knowledge. Our evaluation framework and benchmarking results contribute to understanding how far current models are from achieving universal music representations while establishing metrics for future progress.
Authors: Ezzat Elokda, Andrea Censi, Emilio Frazzoli, Florian Dörfler, Saverio Bolognani
Control systems will play a pivotal role in addressing societal-scale challenges as they drive the development of sustainable future smart cities. At the heart of these challenges is the trustworthy, fair, and efficient allocation of scarce public resources, including renewable energy, transportation, data, computation, etc.. Historical evidence suggests that monetary control -- the prototypical mechanism for managing resource scarcity -- is not always well-accepted in socio-technical resource contexts. In this vision article, we advocate for karma economies as an emerging non-monetary mechanism for socio-technical control. Karma leverages the repetitive nature of many socio-technical resources to jointly attain trustworthy, fair, and efficient allocations; by budgeting resource consumption over time and letting resource users ``play against their future selves.'' To motivate karma, we review related concepts in economics through a control systems lens, and make a case for a) shifting the viewpoint of resource allocations from single-shot and static to repeated and dynamic games; and b) adopting long-run Nash welfare as the formalization of ``fairness and efficiency'' in socio-technical contexts. We show that in many dynamic resource settings, karma Nash equilibria maximize long-run Nash welfare. Moreover, we discuss implications for a future smart city built on multi-karma economies: by choosing whether to combine different socio-technical resources, e.g., electricity and transportation, in a single karma economy, or separate into resource-specific economies, karma provides new flexibility to design the scope of fairness and efficiency.
Authors: Sibo Zhang, Bruno Clerckx, David Vargas
Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique. We propose a novel architecture for downlink RSMA, namely Codeword-Segmentation RSMA (CS-RSMA). Different from conventional RSMA which splits users' messages into common and private parts before encoding, CS-RSMA encodes the users' messages directly, segments the codewords into common and private parts, and transmits the codeword segments using common and private streams. In addition to the principle of CS-RSMA, a novel performance analysis framework is proposed. This framework utilizes a recent discovery in mismatched decoding under finite-alphabet input and interference, and can better capture the receiver's complexity limits. Precoder optimization under finite alphabets and suboptimal decoders for conventional RSMA and CS-RSMA to maximize the Sum-Rate (SR) and the Max-Min Fairness (MMF) is also addressed. The numerical results reveal the theoretical performance of conventional RSMA and CS-RSMA. We observe that CS-RSMA leads to better performance than conventional RSMA in SR, and similar performance in MMF. Furthermore, a physical-layer implementation of CS-RSMA is proposed and evaluated through link-level simulations. Aside performance benefits, we also demonstrate that CS-RSMA brings significant benefits on the encoding/decoding, control signaling, and retransmission process compared to conventional RSMA.
Authors: Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac'h, Preston Culbertson
Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, Judo provides robust implementations of common sampling-based MPC algorithms and standardized benchmark tasks. It further emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While written in Python, the software leverages MuJoCo as its physics backend to achieve real-time performance, which we validate across both consumer and server-grade hardware. Code at this https URL.
Authors: Yike Xu, Mark S. Andersland
In this paper, we shed new light on a classical scheduling problem: given a slot-timed, constant-capacity server, what short-run scheduling decisions must be made to provide long-run service guarantees to competing flows of unit-sized tasks? We model each flow's long-run guarantee as a worst-case service that maps each queued arrival vector recording the flow's cumulative task arrivals, including those initially queued, to a worst-case acceptable departure vector lower-bounding its cumulative served tasks. We show that these maps are states that can be updated as tasks arrive and are served, introduce state-based scheduling, find the schedulability condition necessary and sufficient to maintain all flows' long-run guarantees, and use this condition to identify all short-run scheduling decisions that preserve schedulability. Our framework is general but computationally complex. To reduce complexity, we consider three specializations. First, we show that when satisfactory short-run scheduling decisions exist, at least one can be efficiently identified by maximizing the server's capacity slack, a generalization of the earliest-deadline-first rule. Second, we show that a special class of worst-case services, min-plus services, can be efficiently specified and updated using properties of the min-plus algebra. Finally, we show that efficiency can be further improved by restricting attention to a min-plus service subclass, dual-curve services. This last specialization turns out to be a dynamic extension of service curves that maintains all essential features of our framework while approaching near practical viability.
Authors: Suparno Bhattacharyya, Joseph. P. Cusumano
We study the reduced order modeling of a piecewise-linear, globally nonlinear flexible oscillator in which a Bernoulli-Euler beam is subjected to a position-triggered kick force and a piecewise restoring force at its tip. The nonsmooth boundary conditions, which determine different regions of a hybrid phase space, can generally be expected to excite many degrees of freedom. With kick strength as parameter, the system's bifurcation diagram is found to exhibit a range of periodic and chaotic behaviors. Proper orthogonal decomposition (POD) is used to obtain a single set of global basis functions spanning all of the hybrid regions. The reduced order model (ROM) dimension is chosen using previously developed energy closure analysis, ensuring approximate energy balance on the reduced subspace. This yields accurate ROMs with 8 degrees of freedom. Remarkably, we find that ROMs formulated using using data from individual periodic steady states can nevertheless be used to reconstruct the entire bifurcation structure of the original system without updating. This demonstrates that, despite being constructed with steady state data, the ROMs model sufficiently small transients with enough accuracy to permit using simple continuation for the bifurcation diagram. We also find ROM subspaces obtained for different values of the bifurcation parameter are essentially identical. Thus, POD augmented with energy closure analysis is found to reliably yield effective dimension estimates and ROMs for this nonlinear, nonsmooth system that are robust across stability transitions, including even period doubling cascades to chaos, thereby greatly reducing data requirements and computational costs.
Authors: Manish Prajapat, Johannes Köhler, Matteo Turchetta, Andreas Krause, Melanie N. Zeilinger
Safely exploring environments with a-priori unknown constraints is a fundamental challenge that restricts the autonomy of robots. While safety is paramount, guarantees on sufficient exploration are also crucial for ensuring autonomous task completion. To address these challenges, we propose a novel safe guaranteed exploration framework using optimal control, which achieves first-of-its-kind results: guaranteed exploration for non-linear systems with finite time sample complexity bounds, while being provably safe with arbitrarily high probability. The framework is general and applicable to many real-world scenarios with complex non-linear dynamics and unknown domains. We improve the efficiency of this general framework by proposing an algorithm, SageMPC, SAfe Guaranteed Exploration using Model Predictive Control. SageMPC leverages three key techniques: i) exploiting a Lipschitz bound, ii) goal-directed exploration, and iii) receding horizon style re-planning, all while maintaining the desired sample complexity, safety and exploration guarantees of the framework. Lastly, we demonstrate safe efficient exploration in challenging unknown environments using SageMPC with a car model.
Authors: Shaimaa K. El-Baklish, Anastasios Kouvelas, Michail A. Makridis
Automated vehicle technologies offer a promising avenue for enhancing traffic efficiency, safety, and energy consumption. Among these, Adaptive Cruise Control (ACC) systems stand out as a prevalent form of automation on today's roads, with their time gap settings holding paramount importance. While decreasing the average time headway tends to enhance traffic capacity, it simultaneously raises concerns regarding safety and string stability. This study introduces a novel variable time gap feedback control policy aimed at striking a balance between maintaining a minimum time gap setting under equilibrium car-following conditions, thereby improving traffic capacity, while ensuring string stability to mitigate disturbances away from the equilibrium flow. Leveraging nonlinear $H_\infty$ control technique, the strategy employs a variable time gap component as the manipulated control signal, complemented by a constant time gap component that predominates during car-following equilibrium. The effectiveness of the proposed scheme is evaluated against its constant time-gap counterpart calibrated using field platoon data from the OpenACC dataset. Through numerical and traffic simulations, our findings illustrate that the proposed algorithm effectively dampens perturbations within vehicle platoons, leading to a more efficient and safer mixed traffic flow.
Authors: Runmin Jiang, Zhaoxin Fan, Junhao Wu, Lenghan Zhu, Xin Huang, Tianyang Wang, Heng Huang, Min Xu
3D medical image segmentation is a challenging task with crucial implications for disease diagnosis and treatment planning. Recent advances in deep learning have significantly enhanced fully supervised medical image segmentation. However, this approach heavily relies on labor-intensive and time-consuming fully annotated ground-truth labels, particularly for 3D volumes. To overcome this limitation, we propose a novel probabilistic-aware weakly supervised learning pipeline, specifically designed for 3D medical imaging. Our pipeline integrates three innovative components: a Probability-based Pseudo Label Generation technique for synthesizing dense segmentation masks from sparse annotations, a Probabilistic Multi-head Self-Attention network for robust feature extraction within our Probabilistic Transformer Network, and a Probability-informed Segmentation Loss Function to enhance training with annotation confidence. Demonstrating significant advances, our approach not only rivals the performance of fully supervised methods but also surpasses existing weakly supervised methods in CT and MRI datasets, achieving up to 18.1% improvement in Dice scores for certain organs. The code is available at this https URL.
Authors: Ondřej Mokrý, Pavel Rajmic
A novel variant of the Janssen method for audio inpainting is presented and compared to other popular audio inpainting methods based on autoregressive (AR) modeling. Both conceptual differences and practical implications are discussed. The experiments demonstrate the importance of the choice of the AR model estimator, window/context length, and model order. The results show the superiority of the proposed gap-wise Janssen approach using objective metrics, which is confirmed by a listening test.
Authors: Will Sharpless, Yat Tin Chow, Sylvia Herbert
Hamilton-Jacobi reachability (HJR) provides a value function that encodes the set of states from which a system with bounded control inputs can reach or avoid a target despite any bounded disturbance, and the corresponding robust, optimal control policy. Though powerful, traditional methods for HJR rely on dynamic programming (DP) and suffer from exponential computation growth with respect to state dimension. The recently favored Hopf formula mitigates this ``curse of dimensionality'' by providing an efficient and space-parallelizable approach for solving the reachability problem. However, the Hopf formula can only be applied to linear time-varying systems. To overcome this limitation, we show that the error between a nonlinear system and a linear model can be transformed into an adversarial bounded artificial disturbance. One may then solve the dimension-robust generalized Hopf formula for a linear game with this ``antagonistic error" to perform guaranteed conservative reachability analysis and control synthesis of nonlinear systems; this can be done for problem formulations in which no other HJR method is both computationally feasible and guaranteed. In addition, we offer several technical methods for reducing conservativeness in the analysis. We demonstrate the effectiveness of our results through one illustrative example (the controlled Van der Pol system) that can be compared to standard DP, and one higher-dimensional 15D example (a 5-agent pursuit-evasion game with Dubins cars).
Authors: Will Sharpless, Yat Tin Chow, Sylvia Herbert
Hamilton-Jacobi Reachability (HJR) is a popular method for analyzing the liveness and safety of a dynamical system with bounded control and disturbance. The corresponding HJ value function offers a robust controller and characterizes the reachable sets, but is traditionally solved with Dynamic Programming (DP) and limited to systems of dimension less than six. Recently, the space-parallelizeable, generalized Hopf formula has been shown to also solve the HJ value with a nearly three-log increase in dimension limit, but is limited to linear systems. To extend this potential, we demonstrate how state-augmented (SA) spaces, which are well-known for their improved linearization accuracy, may be used to solve tighter, conservative approximations of the value function with any linear model in this SA space. Namely, we show that with a representation of the true dynamics in the SA space, a series of inequalities confirms that the value of a SA linear game with antagonistic error is a conservative envelope of the true value function. It follows that if the optimal controller for the HJ SA linear game with error may succeed, it will also succeed in the true system. Unlike previous methods, this result offers the ability to safely approximate reachable sets and their corresponding controllers with the Hopf formula in a non-convex manner. Finally, we demonstrate this in the slow manifold system for clarity, and in the controlled Van der Pol system with different lifting functions.
Authors: Manasi Muglikar, Siddharth Somasundaram, Akshat Dave, Edoardo Charbon, Ramesh Raskar, Davide Scaramuzza
Traditional cameras face a trade-off between low-light performance and high-speed imaging: longer exposure times to capture sufficient light results in motion blur, whereas shorter exposures result in Poisson-corrupted noisy images. While burst photography techniques help mitigate this tradeoff, conventional cameras are fundamentally limited in their sensor noise characteristics. Event cameras and single-photon avalanche diode (SPAD) sensors have emerged as promising alternatives to conventional cameras due to their desirable properties. SPADs are capable of single-photon sensitivity with microsecond temporal resolution, and event cameras can measure brightness changes up to 1 MHz with low bandwidth requirements. We show that these properties are complementary, and can help achieve low-light, high-speed image reconstruction with low bandwidth requirements. We introduce a sensor fusion framework to combine SPADs with event cameras to improves the reconstruction of high-speed, low-light scenes while reducing the high bandwidth cost associated with using every SPAD frame. Our evaluation, on both synthetic and real sensor data, demonstrates significant enhancements ( > 5 dB PSNR) in reconstructing low-light scenes at high temporal resolution (100 kHz) compared to conventional cameras. Event-SPAD fusion shows great promise for real-world applications, such as robotics or medical imaging.
Authors: Ondřej Mokrý, Peter Balušík, Pavel Rajmic
The paper focuses on inpainting missing parts of an audio signal spectrogram, i.e., estimating the lacking time-frequency coefficients. The autoregression-based Janssen algorithm, a state-of-the-art for the time-domain audio inpainting, is adapted for the time-frequency setting. This novel method, termed Janssen-TF, is compared with the deep-prior neural network approach using both objective metrics and a subjective listening test, proving Janssen-TF to be superior in all the considered measures.
Authors: Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu
We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL.
Authors: Guohui Cai, Ying Cai, Zeyu Zhang, Yuanzhouhan Cao, Lin Wu, Daji Ergu, Zhinbin Liao, Yang Zhao
Lung cancer remains one of the leading causes of morbidity and mortality worldwide, making early diagnosis critical for improving therapeutic outcomes and patient prognosis. Computer-aided diagnosis systems, which analyze computed tomography images, have proven effective in detecting and classifying pulmonary nodules, significantly enhancing the detection rate of early-stage lung cancer. Although traditional machine learning algorithms have been valuable, they exhibit limitations in handling complex sample data. The recent emergence of deep learning has revolutionized medical image analysis, driving substantial advancements in this field. This review focuses on recent progress in deep learning for pulmonary nodule detection, segmentation, and classification. Traditional machine learning methods, such as support vector machines and k-nearest neighbors, have shown limitations, paving the way for advanced approaches like Convolutional Neural Networks, Recurrent Neural Networks, and Generative Adversarial Networks. The integration of ensemble models and novel techniques is also discussed, emphasizing the latest developments in lung cancer diagnosis. Deep learning algorithms, combined with various analytical techniques, have markedly improved the accuracy and efficiency of pulmonary nodule analysis, surpassing traditional methods, particularly in nodule classification. Although challenges remain, continuous technological advancements are expected to further strengthen the role of deep learning in medical diagnostics, especially for early lung cancer detection and diagnosis. A comprehensive list of lung cancer detection models reviewed in this work is available at this https URL.
Authors: Zirui Chen, Zhaoyang Zhang, Chenyu Liu, Ziqing Xing
Research on leveraging big artificial intelligence model (BAIM) technology to drive the intelligent evolution of wireless networks is emerging. However, breakthroughs in generalization brought about by BAIM techniques mainly occur in natural language processing. There is a lack of a clear technical direction on how to efficiently apply BAIM techniques to wireless systems, which typically have many additional peculiarities. To this end, this paper reviews recent research on BAIM for wireless systems and assesses the current state of the field. It then analyzes and compares the differences between language intelligence and wireless intelligence on multiple levels, including scientific foundations, core usages, and technical details. It highlights the necessity and scientific significance of developing wireless native BAIM technologies, as well as specific issues that need to be considered for technical implementation. Finally, by synthesizing the evolutionary laws of language models with the particularities of wireless systems, this paper provides several instructive methodologies for developing wireless native BAIM.
Authors: Xinze Lyu, Sundar Aditya, Bruno Clerckx
A canonical use case of Integrated Sensing and Communications (ISAC) in multiple-input multiple-output (MIMO) systems involves a multi-antenna transmitter communicating with $K$ users and sensing targets in its vicinity. For this setup, precoder and multiple access designs are of utmost importance, as the limited transmit power budget must be efficiently directed towards the desired directions (users and targets) to maximize both communications and sensing performance. This problem has been widely investigated analytically under various design choices, in particular (a) whether or not a dedicated sensing signal is needed, and (b) for different MIMO multiple access techniques, such as Space Division Multiple Access (SDMA) and Rate-Splitting Multiple Access (RSMA). However, a conclusive answer on which design choice achieves the best ISAC performance, backed by experimental results, remains elusive. We address this vacuum by experimentally evaluating and comparing RSMA and SDMA for communicating with two users $(K = 2)$ and sensing (ranging) one target. Over three scenarios that are representative of \emph{vehicular} ISAC, covering different levels of inter-user interference and separation/integration between sensing and communications, we show that RSMA without a dedicated sensing signal achieves better ISAC performance -- i.e., higher sum throughput (up to $50\%$ peak throughput gain) for similar radar SNR (between $20$ to $24{\rm dB}$) -- than SDMA with a dedicated sensing signal. This first-ever experimental study of RSMA ISAC demonstrates the feasibility and the superiority of RSMA for future multi-functional wireless systems.
Authors: Tierui Gong, Chau Yuen, Chong Meng Samson See, Mérouane Debbah, Lajos Hanzo
Quantum sensing technologies have experienced rapid progresses since entering the `second quantum revolution'. Among various candidates, schemes relying on Rydberg atoms exhibit compelling advantages for detecting radio frequency signals. Based on this, Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution to classical wireless communication and sensing. To harness the advantages and exploit the potential of RAQRs in wireless sensing, we investigate the realization of the direction of arrival (DOA) estimation by RAQRs. Specifically, we first conceive a Rydberg atomic quantum uniform linear array (RAQ-ULA) aided wireless receiver for multi-target DOA detection and propose the corresponding signal model of this sensing system. Our model reveals that the presence of the radio-frequency local oscillator in the RAQ-ULA creates sensor gain mismatches, which degrade the DOA estimation significantly by employing the classical Estimation of Signal Parameters via Rotational Invariant Techniques (ESPRIT). To solve this sensor gain mismatch problem, we propose the Rydberg atomic quantum ESPRIT (RAQ-ESPRIT) relying on our model. Lastly, we characterize our scheme through numerical simulations, where the results exhibit that it is capable of reducing the estimation error of its classical counterpart on the order of $> 400$-fold and $> 9000$-fold in the PSL and SQL, respectively.
Authors: Javier Borquez, Luke Raus, Yusuf Umut Ciftci, Somil Bansal
Designing controllers that are both safe and performant is inherently challenging. This co-optimization can be formulated as a constrained optimal control problem, where the cost function represents the performance criterion and safety is specified as a constraint. While sampling-based methods, such as Model Predictive Path Integral (MPPI) control, have shown great promise in tackling complex optimal control problems, they often struggle to enforce safety constraints. To address this limitation, we propose DualGuard-MPPI, a novel framework for solving safety-constrained optimal control problems. Our approach integrates Hamilton-Jacobi reachability analysis within the MPPI sampling process to ensure that all generated samples are provably safe for the system. On the one hand, this integration allows DualGuard-MPPI to enforce strict safety constraints; at the same time, it facilitates a more effective exploration of the environment with the same number of samples, reducing the effective sampling variance and leading to better performance optimization. Through several simulations and hardware experiments, we demonstrate that the proposed approach achieves much higher performance compared to existing MPPI methods, without compromising safety.
Authors: Zefan Yang, Xuanang Xu, Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan
Chest X-ray (CXR) is the most frequently ordered imaging test, supporting diverse clinical tasks from thoracic disease detection to postoperative monitoring. However, task-specific classification models are limited in scope, require costly labeled data, and lack generalizability to out-of-distribution datasets. To address these challenges, we introduce CheXFound, a self-supervised vision foundation model that learns robust CXR representations and generalizes effectively across a wide range of downstream tasks. We pretrain CheXFound on a curated CXR-1M dataset, comprising over one million unique CXRs from publicly available sources. We propose a Global and Local Representations Integration (GLoRI) module for downstream adaptations, by incorporating disease-specific local features with global image features for enhanced performance in multilabel classification. Our experimental results show that CheXFound outperforms state-of-the-art models in classifying 40 disease findings across different prevalence levels on the CXR-LT 24 dataset and exhibits superior label efficiency on downstream tasks with limited training data. Additionally, CheXFound achieved significant improvements on new tasks with out-of-distribution datasets, including opportunistic cardiovascular disease risk estimation and mortality prediction. These results highlight CheXFound's strong generalization capabilities, enabling diverse adaptations with improved label efficiency. The project source code is publicly available at this https URL.
Authors: Siyuan Wang, Wenchuan Wu, Chenhui Lin, Qi Wang, Shuwei Xu, Binbin Chen
As a part of the integrated energy system (IES), gas pipeline networks can provide additional flexibility to power systems through coordinated optimal dispatch. An accurate pipeline network model is critical for the optimal operation and control of IESs. However, inaccuracies or unavailability of accurate pipeline parameters often introduce errors in the state-space models of such networks. This paper proposes a physics-informed recurrent network (PIRN) to identify the state-space model of gas pipelines. It fuses sparse measurement data with fluid-dynamic behavior expressed by partial differential equations. By embedding the physical state-space model within the recurrent network, parameter identification becomes an end-to-end PIRN training task. The model can be realized in PyTorch through modifications to a standard RNN backbone. Case studies demonstrate that our proposed PIRN can accurately estimate gas pipeline models from sparse terminal node measurements, providing robust performance and significantly higher parameter efficiency. Furthermore, the identified state-space model of the pipeline network can be seamlessly integrated into optimization frameworks.
Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki
In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among the distributed microphones and investigated improvements to beamforming. Finally, for ASR, we developed several models exploiting Whisper and WavLM speech foundation models. We present the results we submitted to the challenge and updated results we obtained afterward. Our strongest system achieves a 63% relative macro tcpWER improvement over the baseline and outperforms the challenge best results on the NOTSOFAR-1 meeting evaluation data among geometry-independent systems.
Authors: Robin Strässer, Manuel Schaller, Julian Berberich, Karl Worthmann, Frank Allgöwer
We derive novel deterministic bounds on the approximation error of data-based bilinear surrogate models for unknown nonlinear systems. The surrogate models are constructed using kernel-based extended dynamic mode decomposition to approximate the Koopman operator in a reproducing kernel Hilbert space. Unlike previous methods that require restrictive assumptions on the invariance of the dictionary, our approach leverages kernel-based dictionaries that allow us to control the projection error via pointwise error bounds, overcoming a significant limitation of existing theoretical guarantees. The derived state- and input-dependent error bounds allow for direct integration into Koopman-based robust controller designs with closed-loop guarantees for the unknown nonlinear system. Numerical examples illustrate the effectiveness of the proposed framework.
Authors: Claudio Fantasia, Luca Calatroni, Xavier Descombes, Rim Rekik
We consider a patch-based learning approach defined in terms of neural networks to estimate spatially adaptive regularisation parameter maps for image denoising with weighted Total Variation (TV) and test it to situations when the noise distribution is unknown. As an example, we consider situations where noise could be either Gaussian or Poisson and perform preliminary model selection by a standard binary classification network. Then, we define a patch-based approach where at each image pixel an optimal weighting between TV regularisation and the corresponding data fidelity is learned in a supervised way using reference natural image patches upon optimisation of SSIM and in a sliding window fashion. Extensive numerical results are reported for both noise models, showing significant improvement w.r.t. results obtained by means of optimal scalar regularisation.
Authors: Abhishek Dhyani, Amirreza Haqshenas Mojaveri, Chengqian Zhang, Dhanika Mahipala, Hoang Anh Tran, Yan-Yun Zhang, Zhongbi Luo, Vasso Reppa
This paper introduces AUTOBargeSim, a simulation toolbox for autonomous inland vessel guidance and control system design. AUTOBargeSim is developed using MATLAB and provides an easy-to-use introduction to various aspects of autonomous inland navigation, including mapping, modelling, control design, and collision avoidance, through examples and extensively documented code. Applying modular design principles in the simulator structure allows it to be easily modified according to the user's requirements. Furthermore, a GUI interface facilitates a simple and quick execution. Key performance indices for evaluating the performance of the controller and collision avoidance method in confined space are also provided. The current version of AUTOBargeSim attempts to improve reproducibility in the design and simulation of marine systems while serving as a foundation for simulating and evaluating vessel behaviour considering operational, system, and environmental constraints.
Authors: Cheng Luo, Luping Xiang, Jie Hu, Kun Yang
Sensing-assisted communication schemes have recently garnered significant research attention. In this work, we design a dual-function reconfigurable intelligent surface (RIS), integrating both active and passive elements, referred to as the reconfigurable intelligent sensing surface (RISS), to enhance communication. By leveraging sensing results from the active elements, we propose communication enhancement and robust interference suppression schemes for both near-field and far-field models, implemented through the passive elements. These schemes remove the need for base station (BS) feedback for RISS control, simplifying the communication process by replacing traditional channel state information (CSI) feedback with real-time sensing from the active elements. The proposed schemes are theoretically analyzed and then validated using software-defined radio (SDR). Experimental results demonstrate the effectiveness of the sensing algorithms in real-world scenarios, such as direction of arrival (DOA) estimation and radio frequency (RF) identification recognition. Moreover, the RISS-assisted communication system shows strong performance in communication enhancement and interference suppression, particularly in near-field models.
Authors: Shuvashis Sarker, Shamim Rahim Refat, Faika Fairuj Preotee, Tanvir Rouf Shawon, Raihan Tanvir
Advanced diagnostic instruments are crucial for the accurate detection and treatment of lung diseases, which affect millions of individuals globally. This study examines the effectiveness of deep learning and transfer learning models using a hybrid dataset, created by merging four individual datasets from Bangladesh and global sources. The hybrid dataset significantly enhances model accuracy and generalizability, particularly in detecting COVID-19, pneumonia, lung opacity, and normal lung conditions from chest X-ray images. A range of models, including CNN, VGG16, VGG19, InceptionV3, Xception, ResNet50V2, InceptionResNetV2, MobileNetV2, and DenseNet121, were applied to both individual and hybrid datasets. The results showed superior performance on the hybrid dataset, with VGG16, Xception, ResNet50V2, and DenseNet121 each achieving an accuracy of 99%. This consistent performance across the hybrid dataset highlights the robustness of these models in handling diverse data while maintaining high accuracy. To understand the models implicit behavior, explainable AI techniques were employed to illuminate their black-box nature. Specifically, LIME was used to enhance the interpretability of model predictions, especially in cases of misclassification, contributing to the development of reliable and interpretable AI-driven solutions for medical imaging.
Authors: Buyi Yu, Wenyuan Tang
Planning and scheduling activities in the electrical power system, such as the commitment of reserve generation, often involve the statistical characterization of peak demand. Due to the stationarity assumption of classical extreme value analysis (EVA), existing approaches in the industry apply EVA on simulated annual peaks created by weather-dependent surrogate models using Monte-Carlo simulations on a per-scenario basis. In day-ahead scheduling, the daily peak demand changes upon various factors besides temperature, Monte-Carlo experiments become intractable, and state-of-the-art generalized additive model for location, scale and shape (GAMLSS)-based nonstationary EVA is often impractical due to convergence issues on high-dimensional covariates. This article explores uncharted territories and proposes a novel nonstationary EVA estimator that predicts the probable peaks of high-resolution time intervals and their corresponding conditional probability densities based on calendar information and weather conditions where historical peaks are observed. Compared to GAMLSS, our method automatically discovers and robustly models complex relationships between the covariate and the peak demand density. We present a case study on the determination of day-ahead scheduling capacity and demonstrate that compared to the industry approach, our approach results in a 38% reduction in the yearly total committed capacity while maintaining the given risk requirement.
Authors: Suhas BN, Andrew M. Sherrill, Jyoti Alaparthi, Dominik Mattioli, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements -- identifying their start and stop times -- directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases -- therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) -- are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.
Authors: Sebastián Rojas-Innocenti, Enrique Baeyens, Alejandro Martín-Crespo, Sergio Saludes-Rodil, Fernando Frechoso Escudero
This paper presents a scenario based robust optimization framework for short term energy scheduling in electricity intensive industrial plants, explicitly addressing uncertainty in planning decisions. The model is formulated as a two-stage Mixed Integer Linear Program (MILP) and integrates a hybrid scenario generation method capable of representing uncertain inputs such as electricity prices, renewable generation, and internal demand. A convex objective function combining expected and worst case operational costs allows for tunable risk aversion, enabling planners to balance economic performance and robustness. The resulting schedule ensures feasibility across all scenarios and supports coordinated use of industrial flexibility assets, including battery energy storage and shiftable production. To isolate the effects of market volatility, the framework is applied to a real world cement manufacturing case study considering only day-ahead electricity price uncertainty, with all other inputs treated deterministically. Results show improved resilience to forecast deviations, reduced cost variability, and more consistent operations. The proposed method offers a scalable and risk-aware approach for industrial flexibility planning under uncertainty.
Authors: Yu Pan, Yuguang Yang, Yanni Hu, Jianhao Ye, Xiang Zhang, Hongbin Zhou, Lei Ma, Jianjun Zhao
Multilingual speech-to-speech translation (S2ST) aims to directly convert spoken utterances from multiple source languages into fluent and intelligible speech in a target language. Despite recent progress, several critical challenges persist: 1) achieving high-quality S2ST remains a significant obstacle; 2) most existing S2ST methods rely heavily on large-scale parallel speech corpora, which are difficult and resource-intensive to obtain. To tackle these challenges, we introduce S2ST-Omni, a novel, efficient, and scalable framework tailored for multilingual speech-to-speech translation. Specifically, we decompose S2ST into speech-to-text translation (S2TT) and text-to-speech synthesis (TTS). To enable high-quality S2TT while mitigating reliance on large-scale parallel speech corpora, we leverage powerful pretrained models: Whisper for robust audio understanding and Qwen 3.0 for advanced text comprehension. A lightweight speech adapter is introduced to bridge the modality gap between speech and text representations, facilitating effective utilization of pretrained multimodal knowledge. To ensure both translation accuracy and real-time responsiveness, we adopt a streaming speech generation model in the TTS stage, which generates the target speech in an autoregressive manner. Extensive experiments conducted on the CVSS benchmark demonstrate that S2ST-Omni consistently surpasses several state-of-the-art S2ST baselines in translation quality, highlighting its effectiveness and superiority.
Authors: Navid Hasanzadeh, Shahrokh Valaee
The newly established IEEE 802.11bf Task Group aims to amend the WLAN standard to support advanced sensing applications such as human activity recognition (HAR). Although studies have demonstrated the potential of sub-7 GHz Wi-Fi Channel State Information (CSI) for HAR, no method currently performs reliably in real-world scenarios. This work tackles the poor generalization of Wi-Fi-based HAR by introducing an innovative approach to extracting and utilizing movement-related representations, which makes it robust to noise and static environmental properties. This is achieved by transforming CSI signals into the delay profile space and decomposing them into various Doppler velocities, which serve as informative projections of a mobile point's velocity from different unknown random angles. To mitigate the impact of this randomness, MORIC is introduced as a novel time series classification model based on random convolutional kernels, designed to be invariant to the random order and repetition of input representations, thereby enabling robust Wi-Fi CSI-based activity classification. Experimental results on the collected dataset demonstrate that the proposed method outperforms state-of-the-art approaches in terms of generalization accuracy for hand motion recognition, particularly for challenging gestures. Furthermore, incorporating a small number of calibration samples leads to a significant improvement in accuracy, enhancing the practicality of the method for real-world deployment.
Authors: Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey
Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based flow-matching decoder to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available.
Authors: Menghua Xia, Reimund Bayerlein, Yanis Chemli, Xiaofeng Liu, Jinsong Ouyang, Georges El Fakhri, Ramsey D. Badawi, Quanzheng Li, Chi Liu
Artificial intelligence-generated content (AIGC) has shown remarkable performance in nuclear medicine imaging (NMI), offering cost-effective software solutions for tasks such as image enhancement, motion correction, and attenuation correction. However, these advancements come with the risk of hallucinations, generating realistic yet factually incorrect content. Hallucinations can misrepresent anatomical and functional information, compromising diagnostic accuracy and clinical trust. This paper presents a comprehensive perspective of hallucination-related challenges in AIGC for NMI, introducing the DREAM report, which covers recommendations for definition, representative examples, detection and evaluation metrics, underlying causes, and mitigation strategies. This position statement paper aims to initiate a common understanding for discussions and future research toward enhancing AIGC applications in NMI, thereby supporting their safe and effective deployment in clinical practice.
Authors: Yang Luo, Arunprakash Jayaprakash, Gaojie Chen, Chong Huang, Qu Luo, Pei Xiao
Satellite communications are crucial for the evolution beyond fifth-generation networks. However, the dynamic nature of satellite channels and their inherent impairments present significant challenges. In this paper, a novel post-compensation scheme that combines the complex-valued extreme learning machine with augmented hidden layer (CELMAH) architecture and widely linear processing (WLP) is developed to address these issues by exploiting signal impropriety in satellite communications. Although CELMAH shares structural similarities with WLP, it employs a different core algorithm and does not fully exploit the signal impropriety. By incorporating WLP principles, we derive a tailored formulation suited to the network structure and propose the CELM augmented by widely linear least squares (CELM-WLLS) for post-distortion. The proposed approach offers enhanced communication robustness and is highly effective for satellite communication scenarios characterized by dynamic channel conditions and non-linear impairments. CELM-WLLS is designed to improve signal recovery performance and outperform traditional methods such as least square (LS) and minimum mean square error (MMSE). Compared to CELMAH, CELM-WLLS demonstrates approximately 0.8 dB gain in BER performance, and also achieves a two-thirds reduction in computational complexity, making it a more efficient solution.
Authors: Ziqin Chen, Yongqiang Wang
Distributed aggregative optimization underpins many cooperative optimization and multi-agent control systems, where each agent's objective function depends both on its local optimization variable and an aggregate of all agents' optimization variables. Existing distributed aggregative optimization approaches typically require access to accurate gradients of the objective functions, which, however, are often hard to obtain in real-world applications. For example, in machine learning, gradients are commonly contaminated by two main sources of noise: the randomness inherent in sampled data, and the additional variability introduced by mini-batch computations. In addition to the issue of relying on accurate gradients, existing distributed aggregative optimization approaches require agents to share explicit information, which could breach the privacy of participating agents. We propose an algorithm that can solve both problems with existing distributed aggregative optimization approaches: not only can the proposed algorithm guarantee mean-square convergence to an exact optimal solution when the gradients are subject to noise, it also simultaneously ensures rigorous differential privacy, with the cumulative privacy budget guaranteed to be finite even when the number of iterations tends to infinity. To the best of our knowledge, this is the first algorithm able to guarantee both accurate convergence and rigorous differential privacy in distributed aggregative optimization. Besides characterizing the convergence rates under nonconvex/convex/strongly convex conditions, we also rigorously quantify the cost of differential privacy in terms of convergence rates. Experimental results on personalized machine learning using benchmark datasets confirm the efficacy of the proposed algorithm.
Authors: Jonghun Kim, Gyeongdeok Jo, Sinyoung Ra, Hyunjin Park
Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for large images (e.g., chest X-rays). In this study, we propose an HE inference framework for medical images that uses VQGAN to compress images into latent representations, thereby significantly reducing the computational burden while preserving image quality. We approximate the activation functions with lower-degree polynomials to balance the accuracy and efficiency in compliance with HE requirements. We observed that a downsampling factor of eight for compression achieved an optimal balance between performance and computational cost. We further adapted the squeeze and excitation module, which is known to improve traditional CNNs, to enhance the HE framework. Our method was tested on two chest X-ray datasets for multi-label classification tasks using vanilla CNN backbones. Although HE inference remains relatively slow and introduces minor performance differences compared with unencrypted inference, our approach shows strong potential for practical use in medical images
Authors: Hamidreza Erfanijazi, Luis A. Camuñas-Mesa, Elisa Vianello, Teresa Serrano-Gotarredona, Bernabé Linares-Barranco
For neuromorphic engineering to emulate the human brain, improving memory density with low power consumption is an indispensable but challenging goal. In this regard, emerging RRAMs have attracted considerable interest for their unique qualities like low power consumption, high integration potential, durability, and CMOS compatibility. Using RRAMs to imitate the more analog storage behavior of brain synapses is also a promising strategy for further improving memory density and power efficiency. However, RRAM devices display strong stochastic behavior, together with relaxation effects, making it more challenging to precisely control their multi-level storage capability. To address this, researchers have reported different multi-level programming strategies, mostly involving the precise control of analog parameters like compliance current during write operations and/or programming voltage amplitudes. Here, we present a new fully digital relaxation-aware method for tuning the conductance of analog RRAMs. The method is based on modulating digital pulse widths during erase operations while keeping other parameters fixed, and therefore requires no precise alterations to analog parameters like compliance currents or programming voltage amplitudes. Experimental results, with and without relaxation effect awareness, on a 64 RRAM 1T1R HfOx memory array of cells, fabricated in 130nm CMOS technology, indicate that it is possible to obtain 2-bit memory per cell multi-value storage at the array level, verified 1000 seconds after programming.
Authors: Philipp L. Kinon, Tobias Thoma, Peter Betsch, Paul Kotyczka
We provide a fully nonlinear port-Hamiltonian formulation for discrete elastodynamical systems as well as a structure-preserving time discretization. The governing equations are obtained in a variational manner and represent index-1 differential algebraic equations. Performing an index reduction one obtains the port-Hamiltonian state space model, which features the nonlinear strains as an independent state next to position and velocity. Moreover, hyperelastic material behavior is captured in terms of a nonlinear stored energy function. The model exhibits passivity and losslessness and has an underlying symmetry yielding the conservation of angular momentum. We perform temporal discretization using the midpoint discrete gradient, such that the beneficial properties are inherited by the developed time stepping scheme in a discrete sense. The numerical results obtained in a representative example are demonstrated to validate the findings.
Authors: Kailas Dayanandan, Nikhil Kumar, Anand Sinha, Brejesh Lall
The dual thinking framework considers fast, intuitive, and slower logical processing. The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ, and the latter is under-explored in current studies. We introduce a novel adversarial dataset to provide evidence for the dual thinking framework in human vision, which also facilitates the study of the qualitative behavior of deep learning models. Our psychophysical studies show the presence of multiple inferences in rapid succession, and analysis of errors shows that the early stopping of visual processing can result in missing relevant information. MLLMs (Multi-modal Large Language Models) and VLMs (Vision Language Models) have made significant progress in correcting errors in intuitive processing in human vision and showed enhanced performance on images requiring logical processing. However, their improvements in logical processing have not kept pace with their advancements in intuitive processing. In contrast, segmentation models exhibit errors similar to those seen in intuitive human processing and lack understanding of sub-structures, as indicated by errors related to sub-components in identified instances. As AI (Artificial Intelligence)-based systems find increasing applications in safety-critical domains like autonomous driving, the integration of logical processing capabilities becomes essential. This not only enhances performance but also addresses the limitations of scaling-based approaches while ensuring robustness and reliability in real-world environments.
Authors: Zhiyu Shao, Qiong Wu, Pingyi Fan, Nan Cheng, Qiang Fan, Jiangzhou Wang
This letter proposes a semantic-aware resource allocation (SARA) framework with flexible duty cycle (DC) coexistence mechanism (SARADC) for 5G-V2X Heterogeneous Network (HetNets) based on deep reinforcement learning (DRL) proximal policy optimization (PPO). Specifically, we investigate V2X networks within a two-tiered HetNets structure. In response to the needs of high-speed vehicular networking in urban environments, we design a semantic communication system and introduce two resource allocation metrics: high-speed semantic transmission rate (HSR) and semantic spectrum efficiency (HSSE). Our main goal is to maximize HSSE. Additionally, we address the coexistence of vehicular users and WiFi users in 5G New Radio Unlicensed (NR-U) networks. To tackle this complex challenge, we propose a novel approach that jointly optimizes flexible DC coexistence mechanism and the allocation of resources and base stations (BSs). Unlike traditional bit transmission methods, our approach integrates the semantic communication paradigm into the communication system. Experimental results demonstrate that our proposed solution outperforms traditional bit transmission methods with traditional DC coexistence mechanism in terms of HSSE and semantic throughput (ST) for both vehicular and WiFi users.
Authors: Kangwei Qi, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Jiangzhou Wang, Khaled B. Letaief
Reconfigurable Intelligent Surface (RIS) is a pivotal technology in communication, offering an alternative path that significantly enhances the link quality in wireless communication environments. In this paper, we propose a RIS-assisted internet of vehicles (IoV) network, considering the vehicle-to-everything (V2X) communication method. In addition, in order to improve the timeliness of vehicle-to-infrastructure (V2I) links and the stability of vehicle-to-vehicle (V2V) links, we introduce the age of information (AoI) model and the payload transmission probability model. Therefore, with the objective of minimizing the AoI of V2I links and prioritizing transmission of V2V links payload, we construct this optimization problem as an Markov decision process (MDP) problem in which the BS serves as an agent to allocate resources and control phase-shift for the vehicles using the soft actor-critic (SAC) algorithm, which gradually converges and maintains a high stability. A AoI-aware joint vehicular resource allocation and RIS phase-shift control scheme based on SAC algorithm is proposed and simulation results show that its convergence speed, cumulative reward, AoI performance, and payload transmission probability outperforms those of proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), twin delayed deep deterministic policy gradient (TD3) and stochastic algorithms.
Authors: Kangwei Qi, Qiong Wu, Pingyi Fan, Nan Cheng, Qiang Fan, Jiangzhou Wang
Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS) is introduced to support vehicle communication and provide an alternative communication path. The system performance can be improved by flexibly adjusting the phase-shift of the RIS. For RIS-assisted VEC system where tasks arrive randomly, we design a control scheme that considers offloading power, local power allocation and phase-shift optimization. To solve this non-convex problem, we propose a new deep reinforcement learning (DRL) framework that employs modified multi-agent deep deterministic policy gradient (MADDPG) approach to optimize the power allocation for vehicle users (VUs) and block coordinate descent (BCD) algorithm to optimize the phase-shift of the RIS. Simulation results show that our proposed scheme outperforms the centralized deep deterministic policy gradient (DDPG) scheme and random scheme.
Authors: Shulin Song, Zheng Zhang, Qiong Wu, Qiang Fan, Pingyi Fan
Autonomous driving may be the most important application scenario of next generation, the development of wireless access technologies enabling reliable and low-latency vehicle communication becomes crucial. To address this, 3GPP has developed Vehicle-to-Everything (V2X) specifications based on 5G New Radio (NR) technology, where Mode 2 Side-Link (SL) communication resembles Mode 4 in LTE-V2X, allowing direct communication between vehicles. This supplements SL communication in LTE-V2X and represents the latest advancement in cellular V2X (C-V2X) with improved performance of NR-V2X. However, in NR-V2X Mode 2, resource collisions still occur, and thus degrade the age of information (AOI). Therefore, a interference cancellation method is employed to mitigate this impact by combining NR-V2X with Non-Orthogonal multiple access (NOMA) technology. In NR-V2X, when vehicles select smaller resource reservation interval (RRI), higher-frequency transmissions take ore energy to reduce AoI. Hence, it is important to jointly consider AoI and communication energy consumption based on NR-V2X communication. Then, we formulate such an optimization problem and employ the Deep Reinforcement Learning (DRL) algorithm to compute the optimal transmission RRI and transmission power for each transmitting vehicle to reduce the energy consumption of each transmitting vehicle and the AoI of each receiving vehicle. Extensive simulations have demonstrated the performance of our proposed algorithm.
Authors: William A. Clark
A study of the dynamics and control for linear and affine hybrid systems subjected to either temporally- or spatially-triggered resets is presented. Hybrid trajectories are capable of degeneracies not found in continuous-time systems namely beating, blocking, and Zeno. These pathologies are commonly avoided by enforcing a lower bound on the time between events. While this constraint is straightforward to implement for temporally-triggered resets, it is impossible to do so for spatially-triggered systems. In particular, linear spatially-triggered hybrid systems always posses trajectories that are beating and blocking while affine systems may also include Zeno trajectories. The hybrid Pontryagin maximum principle is studied in the context of affine hybrid systems. The existence/uniqueness of the induced co-state jump conditions is studied which introduces the notion of strongly and weakly actuated resets. Finally, optimal control in the context of beating and Zeno is discussed. This work concludes with numerical examples.
Authors: Sadia Nowrin, Keith Vertanen
Conversational systems rely heavily on speech recognition to interpret and respond to user commands and queries. Despite progress on speech recognition accuracy, errors may still sometimes occur and can significantly affect the end-user utility of such systems. While visual feedback can help detect errors, it may not always be practical, especially for people who are blind or low-vision. In this study, we investigate ways to improve error detection by manipulating the audio output of the transcribed text based on the recognizer's confidence level in its result. Our findings show that selectively slowing down the audio when the recognizer exhibited uncertainty led to a 12% relative increase in participants' ability to detect errors compared to uniformly slowing the audio. It also reduced the time it took participants to listen to the recognition result and decide if there was an error by 11%.
Authors: Chen Qian, Tangyou Liu, Liao Wu
Follow-the-leader (FTL) motion is essential for continuum robots operating in fragile and confined environments. It allows the robot to exert minimal force on its surroundings, reducing the risk of damage. This paper presents a novel design of a snake-like robot capable of achieving FTL motion by integrating fiber jamming modules (FJMs). The proposed robot can dynamically adjust its stiffness during propagation and interaction with the environment. An algorithm is developed to independently control the tendon and FJM insertion movements, allowing the robot to maintain its shape while minimizing the forces exerted on surrounding structures. To validate the proposed design, comparative tests were conducted between a traditional tendon-driven robot and the novel design under different configurations. The results demonstrate that our design relies significantly less on contact with the surroundings to maintain its shape. This highlights its potential for safer and more effective operations in delicate environments, such as minimally invasive surgery (MIS) or industrial in-situ inspection.
Authors: Benoit Brummer, Christophe De Vleeschouwer
This paper introduces the Raw Natural Image Noise Dataset (RawNIND), a diverse collection of paired raw images designed to support the development of denoising models that generalize across sensors, image development workflows, and styles. Two denoising methods are proposed: one operates directly on raw Bayer data, leveraging computational efficiency, while the other processes linear RGB images for improved generalization to different sensors, with both preserving flexibility for subsequent development. Both methods outperform traditional approaches which rely on developed images. Additionally, the integration of denoising and compression at the raw data level significantly enhances rate-distortion performance and computational efficiency. These findings suggest a paradigm shift toward raw data workflows for efficient and flexible image processing.
Authors: Mehdi Sattari, Deniz Gündüz, Tommy Svensson
Efficient channel state information (CSI) compression is essential in frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems due to the significant feedback overhead. Recently, deep learning-based compression techniques have demonstrated superior performance across various data types, including CSI. However, these methods often suffer from performance degradation when the data distribution shifts, primarily due to limited generalization capabilities. To address this challenge, we propose an online model fine-tuning approach for CSI feedback in massive MIMO systems. We consider full-model fine-tuning, where both the encoder and decoder are jointly updated using recent CSI samples. A key challenge in this setup is the transmission of updated decoder parameters, which introduces additional feedback overhead. To mitigate this bottleneck, we incorporate the bit-rate of model updates into the fine-tuning objective and entropy code the updates jointly with the compressed CSI. To reduce the bit-rate, we design an efficient prior distribution that encourages the network to update only the most significant weights, thereby minimizing the overall model update cost. Our results show that full-model fine-tuning significantly enhances the rate-distortion (RD) performance of neural CSI compression despite the additional communication cost of model updates. Moreover, we investigate the impact of update frequency in dynamic wireless environments and identify an optimal fine-tuning interval that achieves the best RD trade-off.
Authors: Yunsik Kim, Yonghun Song, Yoonyoung Chung
In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging. Throat microphones can offer a solution because of their inherent noise-suppression capabilities; however, the passage of sound waves through skin and tissue attenuates high-frequency information, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the lack of a standard dataset. Here, we introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found mapping-based approaches to be superior for improving speech quality and restoring content. These findings demonstrate the TAPS dataset's utility for speech enhancement tasks and support its potential as a standard resource for advancing research in throat microphone-based applications.
Authors: Piyushi Manupriya, Himanshu, SakethaNath Jagarlapudi, Ganesh Ghalme
We investigate the problem of maximizing social welfare while ensuring fairness in a multi-agent multi-armed bandit (MA-MAB) setting. In this problem, a centralized decision-maker takes actions over time, generating random rewards for various agents. Our goal is to maximize the sum of expected cumulative rewards, a.k.a. social welfare, while ensuring that each agent receives an expected reward that is at least a constant fraction of the maximum possible expected reward. Our proposed algorithm, RewardFairUCB, leverages the Upper Confidence Bound (UCB) technique to achieve sublinear regret bounds for both fairness and social welfare. The fairness regret measures the positive difference between the minimum reward guarantee and the expected reward of a given policy, whereas the social welfare regret measures the difference between the social welfare of the optimal fair policy and that of the given policy. We show that RewardFairUCB algorithm achieves instance-independent social welfare regret guarantees of $\tilde{O}(T^{1/2})$ and a fairness regret upper bound of $\tilde{O}(T^{3/4})$. We also give the lower bound of $\Omega(\sqrt{T})$ for both social welfare and fairness regret. We evaluate RewardFairUCB's performance against various baseline and heuristic algorithms using simulated data and real world data, highlighting trade-offs between fairness and social welfare regrets.
Authors: Qi Mao, Haobo Hu, Yujie He, Difei Gao, Haokun Chen, Libiao Jin
Affective Image Manipulation (AIM) aims to alter visual elements within an image to evoke specific emotional responses from viewers. However, existing AIM approaches rely on rigid \emph{one-to-one} mappings between emotions and visual cues, making them ill-suited for the inherently subjective and diverse ways in which humans perceive and express this http URL address this, we introduce a novel task setting termed \emph{Diverse AIM (D-AIM)}, aiming to generate multiple visually distinct yet emotionally consistent image edits from a single source image and target emotion. We propose \emph{EmoAgent}, the first multi-agent framework tailored specifically for D-AIM. EmoAgent explicitly decomposes the manipulation process into three specialized phases executed by collaborative agents: a Planning Agent that generates diverse emotional editing strategies, an Editing Agent that precisely executes these strategies, and a Critic Agent that iteratively refines the results to ensure emotional accuracy. This collaborative design empowers EmoAgent to model \emph{one-to-many} emotion-to-visual mappings, enabling semantically diverse and emotionally faithful this http URL quantitative and qualitative evaluations demonstrate that EmoAgent substantially outperforms state-of-the-art approaches in both emotional fidelity and semantic diversity, effectively generating multiple distinct visual edits that convey the same target emotion.
Authors: Zubair Shaban, Nazreen Shah, Ranjitha Prasad
In 6G wireless networks, Artificial Intelligence (AI)-driven applications demand the adoption of Federated Learning (FL) to enable efficient and privacy-preserving model training across distributed devices. Over-The-Air Federated Learning (OTA-FL) exploits the superposition property of multiple access channels, allowing edge users in 6G networks to efficiently share spectral resources and perform low-latency global model aggregation. However, these advantages come with challenges, as traditional OTA-FL techniques suffer due to the joint effects of Additive White Gaussian Noise (AWGN) at the server, fading, and both data and system heterogeneity at the participating edge devices. In this work, we propose the novel Noise Resilient Over-the-Air Federated Learning (NoROTA-FL) framework to jointly tackle these challenges in federated wireless networks. In NoROTA-FL, the local optimization problems find controlled inexact solutions, which manifests as an additional proximal constraint at the clients. This approach provides robustness against straggler-induced partial work, heterogeneity, noise, and fading. From a theoretical perspective, we leverage the zeroth- and first-order inexactness and establish convergence guarantees for non-convex optimization problems in the presence of heterogeneous data and varying system capabilities. Experimentally, we validate NoROTA-FL on real-world datasets, including FEMNIST, CIFAR10, and CIFAR100, demonstrating its robustness in noisy and heterogeneous environments. Compared to state-of-the-art baselines such as COTAF and FedProx, NoROTA-FL achieves significantly more stable convergence and higher accuracy, particularly in the presence of stragglers.
Authors: Xiaojun Yuan, Haoming Ma, Yinuo Huang, Zhoufan Hua, Yong Zuo, Zhi Ding
Semantic communications leverage artificial intelligence (AI) technologies to extract semantic information for efficient data delivery, thereby significantly reducing communication cost. With the evolution towards artificial general intelligence (AGI), the increasing demands for AGI services pose new challenges to semantic communications. In this context, an AGI application is typically defined on a general-sense task, covering a broad, even unforeseen, set of objectives, as well as driven by the need for a human-friendly interface in forms (e.g., videos, images, or text) easily understood by human this http URL response, we introduce an AGI-driven communication paradigm for supporting AGI applications, called generative semantic communication (GSC). We first describe the basic concept of GSC and its difference from existing semantic communications, and then introduce a general framework of GSC based on advanced AI technologies including foundation models and generative models. Two case studies are presented to verify the advantages of GSC. Finally, open challenges and new research directions are discussed to stimulate this line of research and pave the way for practical applications.
Authors: Nahshon Mokua Obiri, Kristof Van Laerhoven
Modeling path loss in indoor LoRaWAN technology deployments is inherently challenging due to structural obstructions, occupant density and activities, and fluctuating environmental conditions. This study proposes a two-stage approach to capture and analyze these complexities using an extensive dataset of 1,328,334 field measurements collected over six months in a single-floor office at the University of Siegen's Hoelderlinstrasse Campus, Germany. First, we implement a multiple linear regression model that includes traditional propagation metrics (distance, structural walls) and an extension with proposed environmental variables (relative humidity, temperature, carbon dioxide, particulate matter, and barometric pressure). Using analysis of variance, we demonstrate that adding these environmental factors can reduce unexplained variance by 42.32 percent. Secondly, we examine residual distributions by fitting five candidate probability distributions: Normal, Skew-Normal, Cauchy, Student's t, and Gaussian Mixture Models (GMMs) with 2 to 5 components. Our results show that a four-component Gaussian Mixture Model captures the residual heterogeneity of indoor signal propagation most accurately, significantly outperforming single-distribution approaches. Given the push toward ultra-reliable, context-aware communications in 6G networks, our analysis shows that environment-aware modeling can substantially improve LoRaWAN network design in dynamic indoor IoT deployments.
Authors: Bowen Li, Zekun Chen, Xuefei Chen, Luhao Zhang, Shili Liang
A wireless wearable Electrical Impedance Tomography (EIT) system has been developed utilizing the AD5933 chip to achieve real-time imaging of lung respiration. The system employs a voltage excitation method tailored to human impedance characteristics, injecting current by applying a known voltage and measuring the resulting current through the body. Additionally, specific measures have been implemented to effectively suppress signal oscillations and leakage currents caused by parasitic capacitances. To enhance data acquisition speed, the system employs five parallel AD5933 units, with multiple techniques implemented to ensure high synchronization during simultaneous measurements. Performance testing shows that the system achieves a signal-to-noise ratio greater than 50 dB, a relative standard deviation below 0.3%, and a reciprocity error under 0.8%. Imaging experiments using a water tank phantom, human lungs during breathing, and a resting human calf further demonstrate that this portable EIT system can accurately measure biological tissues with high precision and low cost.
Authors: Mingyao Cui, Qunsong Zeng, Zhanwei Wang, Kaibin Huang
Harnessing multi-level electron transitions, Rydberg Atomic REceivers (RAREs) can detect wireless signals across a wide range of frequency bands, from Megahertz to Terahertz, enabling multi-band communications and sensing (CommunSense). Current research on multi-band RAREs primarily focuses on experimental demonstrations, lacking a tractable model to mathematically characterize their mechanisms. This issue leaves the multi-band RARE as a black box, posing challenges in its practical CommunSense applications. To fill in this gap, this paper investigates the underlying mechanism of multi-band RAREs and explores their optimal performance. For the first time, the closed-form expression of the transfer function of a multi-band RARE is derived by solving the quantum response of Rydberg atoms excited by multi-band signals. The function reveals that a multi-band RARE simultaneously serves as both a multi-band atomic mixer for down-converting multi-band signals and a multi-band atomic amplifier that reflects its sensitivity to each band. Further analysis of the atomic amplifier unveils that the gain factor at each frequency band can be decoupled into a global gain term and a Rabi attention term. The former determines the overall sensitivity of a RARE to all frequency bands of wireless signals. The latter influences the allocation of the overall sensitivity to each frequency band, representing a unique attention mechanism of multi-band RAREs. The optimal design of the global gain is provided to maximize the overall sensitivity of multi-band RAREs. Subsequently, the optimal Rabi attentions are also derived to maximize the practical multi-band CommunSense performance. Numerical results confirm the effectiveness of the derived transfer function and the superiority of multi-band RAREs.
Authors: Blake McGrane-Corrigan, Rafael de Andrade Moral, Oliver Mason
We consider the problem of robust diffusive stability (RDS) for a pair of coupled stable discrete-time positive linear-time invariant (LTI) systems. We first show that the existence of a common diagonal Lyapunov function is sufficient for RDS and highlight how this condition differs from recent results using linear copositive Lyapunov functions. We also present an extension of these results, showing that the weaker condition of \emph{joint} linear copositive function existence is also sufficient for RDS. Finally, we present two results on RDS for extended Leslie matrices arising in population dynamics.
Authors: Kaiyuan Chen, Zhengjie Hu, Shaolin Zhang, Yuanqing Xia, Wannian Liang, Shuo Wang
The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significant challenges, including collision avoidance, coverage efficiency, and constrained flight environments. In this study, we propose an enhanced trust region sequential convex optimization (TR-SCO) algorithm for optimal trajectory planning of multiple drones performing thermal screening tasks. Our improved algorithm integrates a refined convex optimization formulation within a trust region framework, effectively balancing trajectory smoothness, obstacle avoidance, altitude constraints, and maximum screening coverage. Simulation results demonstrate that our approach significantly improves trajectory optimality and computational efficiency compared to conventional convex optimization methods. This research provides critical insights and practical contributions toward deploying efficient multi-drone systems for real-time thermal screening in urban areas. For reader who are interested in our research, we release our source code at this https URL.
Authors: Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky
Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.
Authors: Yanhong Luo, Wenchao Meng, Xi Zhu, Andreas Elombo, Hu Rong, Bing Xie, Tianwen Zhang
With the increasing prevalence of distributed generators, islanded operation based on distributed generation is considered a vital means to enhance the reliability and resilience of smart grids. This paper investigates the main factors in islanding partition of smart grids and establishes a mathematical model for islanding division. A method to determine the maximum power supply range of distributed energy resources (DERs) based on the reachability matrix and power circle algorithm is proposed to improve computational efficiency. A dynamic programming method based on breadth-first search (BFS) is used to solve the islanding partition scheme, and a region correction method is applied to modify the maximum power supply area by considering controllable loads and prioritizing critical load restoration, thereby enhancing system resilience. Finally, simulation results verify the effectiveness of the proposed algorithm in improving smart grid resilience.
Authors: E. D. Gomez Anccas, C. A. Hans, D. Schulz
Modern low-carbon power systems come with many challenges, such as increased inverter penetration and increased uncertainty from renewable sources and loads. In this context, the microgrid concept is a promising approach, which is based on a segmentation of the grid into independent smaller cells that can run either in grid-connected or standalone this http URL microgrids, droop control is widely used for primary control. It enables proportional power sharing, depending on the droop gains. Operation control schemes considering droop control often assume fixed droop gains. However, using adaptive droop gains for grid-forming units allow to shape power sharing in presence of fluctuations, enhancing flexibility while maintaining a safe microgrid operation, particularly under uncertainty. This work introduces a bilinear formulation for microgrid operation control that finds optimal power setpoints and droop gains on a timescale of minutes by solving a finite horizon optimization problem. In detail, a robust minmax model predictive control scheme is designed for a standalone microgrid, comprising a fuel cell, a photovoltaic system and an energy storage. Closed-loop simulations are performed with and without variable droop gains. The results show an increase in renewable utilization of up to 7.5 % while reducing the power output of the fuel cell by 6 %, when allowing variable droop gains.
Authors: Ruonan Lia, Chang Wena, Mingyu Yan, Congcong Wu, Ahmed Lotfy Elrefai, Xiaotong Zhang, Sahban Wael Saeed Alnaser
This study focuses on the novel municipal-scale rural integrated energy system (RIES), which encompasses energy supply and application. By constructing a seven-dimensional evaluation system including energy efficiency, energy supply, low-carbon sustainability, environmental impact, energy economy, social benefits, and integrated energy system development, this research combines the improved analytic hierarchy process (IAHP) and entropy weight method (EWM) by sum of squares of deviations to balance expert experience and data objectivity. Furthermore, the cloud model is introduced to handle the fuzziness and randomness in the evaluation. This method can quantify the differences in system performance before and after the planning implementation. The results indicate that after planning, the comprehensive score has increased from 83.12 to 87.55, the entropy value has decreased from 6.931 to 5.336, indicating enhanced system stability. The hyper-entropy has dropped from 3.08 to 2.278, reflecting a reduction in uncertainty. The research findings provide a scientific basis for the planning optimization, policy-making, and sustainable development of rural integrated energy systems, possessing both theoretical innovation and practical guiding value.
Authors: Oraib Dawaghreh, Sharaf K. Magableh, Xuesong Wang, Mohammad Adnan Magableh, Caisheng Wang
As traditional large hydropower has been extensively exploited, micro-hydro systems have caught research increasing interest. New engineering challenges arise in developing micro-hydro systems in areas with significant elevation but prohibitive horizontal distances between primary reservoirs. This study addresses these challenges by proposing a cascade-pumped micro-hydro storage (CPMHS) system that leverages intermediate reservoirs to bridge long horizontal distances, enabling efficient energy transfer and storage. The methodology utilizes naturally occurring lakes with substantial head heights but limited feasibility for direct pumped storage due to horizontal separations. Integrating smaller, strategically placed intermediate reservoirs maximizes energy capture along the cascading path, making pumped storage viable in geographically constrained locations. The proposed system will enhance energy generation potential and provide additional benefits for water management. Using geographical data and a detailed case study focused on Mountain Lake and surrounding lakes, this paper demonstrates the energy efficiency and viability of cascade-based micro-hydro storage. A practical methodology for implementing CPMHS systems is proposed and validated by case studies. An optimization framework is developed for efficient energy capture in regions with challenging topography.
Authors: Ryan Piansky, Rahul K. Gupta, Daniel K. Molzahn
With electric power infrastructure posing an increasing risk of igniting wildfires under continuing climate change, utilities are frequently de-energizing power lines to mitigate wildfire ignition risk, which can cause load shedding. Recent research advocates for installing battery energy storage systems as well as undergrounding risky overhead lines to reduce the load shedding during such de-energizations. Since wildfire ignition risk can exhibit substantial geographic and temporal variations, it is important to plan battery installation and line undergrounding investments while considering multiple possible scenarios. This paper presents a scenario-based framework for optimizing battery installation and line undergrounding investments while considering many scenarios, each consisting of a day-long time series of uncertain parameters for the load demand, renewable generation, and wildfire ignition risks. This problem is difficult to solve due to a large number of scenarios and binary variables associated with the battery placements as well as the lines to be undergrounded. To address the computational challenges, we decompose the problem in a two-stage scheme via a Benders decomposition approach. The first stage is a master problem formulated as a mixed integer linear programming (MILP) model that makes decisions on the locations and sizes of batteries as well as the lines to be undergrounded. The second stage consists of a linear programming model that assesses these battery and line undergrounding decisions as modeled by a DC OPF formulation. We demonstrate the effectiveness of the proposed scheme on a large-scale transmission network with real world data on wildfire ignition risks, load, and renewable generation.
Authors: Sha Ye, Qiong Wu, Pingyi Fan, Qiang Fan
Internet of Vehicles (IoV), as the core of intelligent transportation system, enables comprehensive interconnection between vehicles and their surroundings through multiple communication modes, which is significant for autonomous driving and intelligent traffic management. However, with the emergence of new applications, traditional communication technologies face the problems of scarce spectrum resources and high latency. Semantic communication, which focuses on extracting, transmitting, and recovering some useful semantic information from messages, can reduce redundant data transmission, improve spectrum utilization, and provide innovative solutions to communication challenges in the IoV. This paper systematically reviews state of art of semantic communications in the IoV, elaborates the technical background of IoV and semantic communications, and deeply discusses key technologies of semantic communications in IoV, including semantic information extraction, semantic communication architecture, resource allocation and management, and so on. Through specific case studies, it demonstrates that semantic communications can be effectively employed in the scenarios of traffic environment perception and understanding, intelligent driving decision support, IoV service optimization, and intelligent traffic management. Additionally, it analyzes the current challenges and future research directions. This survey reveals that semantic communications has broad application prospects in IoV, but it is necessary to solve the real existing problems by combining advanced technologies to promote its wide application in IoV and contributing to the development of intelligent transportation system.
Authors: Yanhong Luo, Wenchao Meng, Xi Zhu, Andreas Elombo, Hu Rong, Bing Xie, Tianwen Zhang
With the increasing prevalence of distributed generators, islanded operation based on distributed generation is considered a vital means to enhance the reliability and resilience of smart grids. This paper investigates the main factors in islanding partition of smart grids and establishes a mathematical model for islanding division. A method to determine the maximum power supply range of distributed energy resources (DERs) based on the reachability matrix and power circle algorithm is proposed to improve computational efficiency. A dynamic programming method based on breadth-first search (BFS) is used to solve the islanding partition scheme, and a region correction method is applied to modify the maximum power supply area by considering controllable loads and prioritizing critical load restoration, thereby enhancing system resilience. Finally, simulation results verify the effectiveness of the proposed algorithm in improving smart grid resilience.
Authors: E. D. Gomez Anccas, C. A. Hans, D. Schulz
Modern low-carbon power systems come with many challenges, such as increased inverter penetration and increased uncertainty from renewable sources and loads. In this context, the microgrid concept is a promising approach, which is based on a segmentation of the grid into independent smaller cells that can run either in grid-connected or standalone this http URL microgrids, droop control is widely used for primary control. It enables proportional power sharing, depending on the droop gains. Operation control schemes considering droop control often assume fixed droop gains. However, using adaptive droop gains for grid-forming units allow to shape power sharing in presence of fluctuations, enhancing flexibility while maintaining a safe microgrid operation, particularly under uncertainty. This work introduces a bilinear formulation for microgrid operation control that finds optimal power setpoints and droop gains on a timescale of minutes by solving a finite horizon optimization problem. In detail, a robust minmax model predictive control scheme is designed for a standalone microgrid, comprising a fuel cell, a photovoltaic system and an energy storage. Closed-loop simulations are performed with and without variable droop gains. The results show an increase in renewable utilization of up to 7.5 % while reducing the power output of the fuel cell by 6 %, when allowing variable droop gains.
Authors: Ruonan Lia, Chang Wena, Mingyu Yan, Congcong Wu, Ahmed Lotfy Elrefai, Xiaotong Zhang, Sahban Wael Saeed Alnaser
This study focuses on the novel municipal-scale rural integrated energy system (RIES), which encompasses energy supply and application. By constructing a seven-dimensional evaluation system including energy efficiency, energy supply, low-carbon sustainability, environmental impact, energy economy, social benefits, and integrated energy system development, this research combines the improved analytic hierarchy process (IAHP) and entropy weight method (EWM) by sum of squares of deviations to balance expert experience and data objectivity. Furthermore, the cloud model is introduced to handle the fuzziness and randomness in the evaluation. This method can quantify the differences in system performance before and after the planning implementation. The results indicate that after planning, the comprehensive score has increased from 83.12 to 87.55, the entropy value has decreased from 6.931 to 5.336, indicating enhanced system stability. The hyper-entropy has dropped from 3.08 to 2.278, reflecting a reduction in uncertainty. The research findings provide a scientific basis for the planning optimization, policy-making, and sustainable development of rural integrated energy systems, possessing both theoretical innovation and practical guiding value.
Authors: Oraib Dawaghreh, Sharaf K. Magableh, Xuesong Wang, Mohammad Adnan Magableh, Caisheng Wang
As traditional large hydropower has been extensively exploited, micro-hydro systems have caught research increasing interest. New engineering challenges arise in developing micro-hydro systems in areas with significant elevation but prohibitive horizontal distances between primary reservoirs. This study addresses these challenges by proposing a cascade-pumped micro-hydro storage (CPMHS) system that leverages intermediate reservoirs to bridge long horizontal distances, enabling efficient energy transfer and storage. The methodology utilizes naturally occurring lakes with substantial head heights but limited feasibility for direct pumped storage due to horizontal separations. Integrating smaller, strategically placed intermediate reservoirs maximizes energy capture along the cascading path, making pumped storage viable in geographically constrained locations. The proposed system will enhance energy generation potential and provide additional benefits for water management. Using geographical data and a detailed case study focused on Mountain Lake and surrounding lakes, this paper demonstrates the energy efficiency and viability of cascade-based micro-hydro storage. A practical methodology for implementing CPMHS systems is proposed and validated by case studies. An optimization framework is developed for efficient energy capture in regions with challenging topography.
Authors: Ryan Piansky, Rahul K. Gupta, Daniel K. Molzahn
With electric power infrastructure posing an increasing risk of igniting wildfires under continuing climate change, utilities are frequently de-energizing power lines to mitigate wildfire ignition risk, which can cause load shedding. Recent research advocates for installing battery energy storage systems as well as undergrounding risky overhead lines to reduce the load shedding during such de-energizations. Since wildfire ignition risk can exhibit substantial geographic and temporal variations, it is important to plan battery installation and line undergrounding investments while considering multiple possible scenarios. This paper presents a scenario-based framework for optimizing battery installation and line undergrounding investments while considering many scenarios, each consisting of a day-long time series of uncertain parameters for the load demand, renewable generation, and wildfire ignition risks. This problem is difficult to solve due to a large number of scenarios and binary variables associated with the battery placements as well as the lines to be undergrounded. To address the computational challenges, we decompose the problem in a two-stage scheme via a Benders decomposition approach. The first stage is a master problem formulated as a mixed integer linear programming (MILP) model that makes decisions on the locations and sizes of batteries as well as the lines to be undergrounded. The second stage consists of a linear programming model that assesses these battery and line undergrounding decisions as modeled by a DC OPF formulation. We demonstrate the effectiveness of the proposed scheme on a large-scale transmission network with real world data on wildfire ignition risks, load, and renewable generation.
Authors: Anshu Ankolekar, Sebastian Boie, Maryam Abdollahyan, Emanuela Gadaleta, Seyed Alireza Hasheminasab, Guang Yang, Charles Beauville, Nikolaos Dikaios, George Anthony Kastis, Michael Bussmann, Sara Khalid, Hagen Kruger, Philippe Lambin, Giorgos Papanastasiou
Federated Learning (FL) has emerged as a promising solution to address the limitations of centralised machine learning (ML) in oncology, particularly in overcoming privacy concerns and harnessing the power of diverse, multi-center data. This systematic review synthesises current knowledge on the state-of-the-art FL in oncology, focusing on breast, lung, and prostate cancer. Distinct from previous surveys, our comprehensive review critically evaluates the real-world implementation and impact of FL on cancer care, demonstrating its effectiveness in enhancing ML generalisability, performance and data privacy in clinical settings and data. We evaluated state-of-the-art advances in FL, demonstrating its growing adoption amid tightening data privacy regulations. FL outperformed centralised ML in 15 out of the 25 studies reviewed, spanning diverse ML models and clinical applications, and facilitating integration of multi-modal information for precision medicine. Despite the current challenges identified in reproducibility, standardisation and methodology across studies, the demonstrable benefits of FL in harnessing real-world data and addressing clinical needs highlight its significant potential for advancing cancer research. We propose that future research should focus on addressing these limitations and investigating further advanced FL methods, to fully harness data diversity and realise the transformative power of cutting-edge FL in cancer care.
Authors: Sha Ye, Qiong Wu, Pingyi Fan, Qiang Fan
Internet of Vehicles (IoV), as the core of intelligent transportation system, enables comprehensive interconnection between vehicles and their surroundings through multiple communication modes, which is significant for autonomous driving and intelligent traffic management. However, with the emergence of new applications, traditional communication technologies face the problems of scarce spectrum resources and high latency. Semantic communication, which focuses on extracting, transmitting, and recovering some useful semantic information from messages, can reduce redundant data transmission, improve spectrum utilization, and provide innovative solutions to communication challenges in the IoV. This paper systematically reviews state of art of semantic communications in the IoV, elaborates the technical background of IoV and semantic communications, and deeply discusses key technologies of semantic communications in IoV, including semantic information extraction, semantic communication architecture, resource allocation and management, and so on. Through specific case studies, it demonstrates that semantic communications can be effectively employed in the scenarios of traffic environment perception and understanding, intelligent driving decision support, IoV service optimization, and intelligent traffic management. Additionally, it analyzes the current challenges and future research directions. This survey reveals that semantic communications has broad application prospects in IoV, but it is necessary to solve the real existing problems by combining advanced technologies to promote its wide application in IoV and contributing to the development of intelligent transportation system.
Authors: Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, Hao-Wen Dong
Despite recent advancements in music generation systems, their application in film production remains limited, as they struggle to capture the nuances of real-world filmmaking, where filmmakers consider multiple factors-such as visual content, dialogue, and emotional tone-when selecting or composing music for a scene. This limitation primarily stems from the absence of comprehensive datasets that integrate these elements. To address this gap, we introduce Open Screen Sound Library (OSSL), a dataset consisting of movie clips from public domain films, totaling approximately 36.5 hours, paired with high-quality soundtracks and human-annotated mood information. To demonstrate the effectiveness of our dataset in improving the performance of pre-trained models on film music generation tasks, we introduce a new video adapter that enhances an autoregressive transformer-based text-to-music model by adding video-based conditioning. Our experimental results demonstrate that our proposed approach effectively enhances MusicGen-Medium in terms of both objective measures of distributional and paired fidelity, and subjective compatibility in mood and genre. The dataset and code are available at this https URL.
Authors: Hengyu Liu, Yanhong Luo, Congcong Wu, Yin Guan, Ahmed Lotfy Elrefai, Andreas Elombo, Si Li, Sahban Wael Saeed Alnaser, Mingyu Yan
The large-scale access of electric vehicles to the power grid not only provides flexible adjustment resources for the power system, but the temporal uncertainty and distribution complexity of their energy interaction pose significant challenges to the economy and robustness of the micro-energy network. In this paper, we propose a multi-time scale rolling optimization scheduling method for micro-energy networks considering the access of electric vehicles. In order to solve the problem of evaluating the dispatchable potential of electric vehicle clusters, a charging station aggregation model was constructed based on Minkowski summation theory, and the scattered electric vehicle resources were aggregated into virtual energy storage units to participate in system scheduling. Integrate price-based and incentive-based demand response mechanisms to synergistically tap the potential of source-load two-side regulation; On this basis, a two-stage optimal scheduling model of day-ahead and intra-day is constructed. The simulation results show that the proposed method reduces the scale of "preventive curtailment" due to more accurate scheduling, avoids the threat of power shortage to the safety of the power grid, and has more advantages in the efficiency of new energy consumption. At the same time, intra-day scheduling significantly reduces economic penalties and operating costs by avoiding output shortages, and improves the economy of the system in an uncertain forecasting environment.
Authors: Olav Galteland, Jacob Hadler-Jacobsen, Hanne Kauko
This study investigates the economic viability and optimal configuration of a hybrid industrial energy system combining an electrode boiler, steam accumulator, and battery energy storage system (BESS). This study optimizes system operation for a specific configuration to minimize net energy costs, defined as energy costs minus profits from price arbitrage and reserve markets. The optimization uses load shifting, peak shaving, and frequency containment reserve market participation with hourly 2024 data from Norway and Germany. Net present value (NPV) analysis was performed to determine the most cost-efficient energy storage configurations. The results show that current investment costs favor steam accumulators over BESS in both countries. However, a reduction in BESS cost will make batteries economically competitive, particularly in Germany, where high price volatility and power-based grid tariffs provide stronger incentives for load shifting and peak shaving. Participation in the FCR market accounted for a 17% and 7% reduction of the net energy costs in Norway and Germany, respectively. Utilization of excess heat, through inlet water preheating, further reduced the net energy costs. Sensitivity analyses confirm that investment costs, especially for BESS, strongly influence optimal system design. These findings offer guidance for industrial flexibility investments across diverse electricity markets.
Authors: Jared Miller, Maitraya Avadhut Desai, Xiuqiang He, Roy S. Smith, Gabriela Hug
Grid-forming inverters control the power transfer between the AC and DC sides of an electrical grid while maintaining the frequency and voltage of the AC side. This paper focuses on ensuring large-signal stability of an electrical grid with inverter-interfaced renewable sources. We prove that the Hybrid-Angle Control (HAC) scheme for grid-forming inverters can exhibit incremental passivity properties between current and voltage at both the AC and DC ports. This incremental passivity can be certified through decentralized conditions. Inverters operating under HAC can, therefore, be connected to other passive elements (e.g. transmission lines) with an immediate guarantee of global transient stability regardless of the network topology or parameters. Passivity of Hybrid Angle Control is also preserved under small-signal (linearized) analyses, in contrast to conventional proportional droop laws that are passivity-short at low frequencies. Passivity and interconnected-stability properties are demonstrated through an example case study.
Authors: Young-ho Cho, Min-Seung Ko, Hao Zhu
A sustainable electricity infrastructure requires the explicit integration of carbon emissions into power system modeling and optimization paradigms. However, existing open-source datasets for power system R&D lack generator-level carbon emission profiling, limiting the ability to benchmark and compare various carbon-aware grid operational strategies. To address this gap, this work introduces PGLib-CO2, an open-source extension to the widely adopted PGLib-OPF test case library. PGLib-CO2 enriches standard network cases with CO2 and CO2-equivalent emission intensity factors by expanding the fuel-type categorization used by PGLib-OPF, attaining a realistic generator-level carbon profiling. It is also packaged for both Python's pandapower and Julia's this http URL, for a seamless, user-friendly integration of emission modeling into grid computation and optimization tasks. The dataset produced by PGLib-CO2 can support grid-based carbon accounting, emission metric evaluation, and integration into AC optimal power flow (OPF) and optimal load shifting (OLS) formulations. We demonstrate PGLib-CO2's utility through case studies that quantify cost-emission trade-offs and optimize a carbon-aware objective function. By standardizing carbon-enhanced test cases, PGLib-CO2 provides an open-source, reproducible foundation for benchmarking carbon-aware computation, facilitating future research in sustainable power system operation.
Authors: Andreas Bouterakos, Georgios Tzounas
The paper focuses on the numerical stability and accuracy of implicit time-domain integration (TDI) methods when applied for the solution of a power system model impacted by time delays. Such a model is generally formulated as a set of delay differential algebraic equations (DDAEs) in non index-1 Hessenberg form. In particular, the paper shows that numerically stable ordinary differential equation (ODE) methods, such as the trapezoidal and the Theta method, can become unstable when applied to a power system that includes a significant number of delayed variables. Numerical stability is discussed through a scalar test delay differential equation, as well as through a matrix pencil approach that accounts for the DDAEs of any given dynamic power system model. Simulation results are presented in a case study based on the IEEE 39-bus system.
Authors: Peng Liu, Meng Hua, Guangji Chen, Xinyi Wang, Zesong Fei
In this paper,we investigate a novel wireless powered mobile edge computing (MEC) system assisted by pinching antennas (PAs), where devices first harvest energy from a base station and then offload computation-intensive tasks to an MEC server. As an emerging technology, PAs utilize long dielectric waveguides embedded with multiple localized dielectric particles, which can be spatially configured through a pinching mechanism to effectively reduce large-scale propagation loss. This capability facilitates both efficient downlink energy transfer and uplink task offloading. To fully exploit these advantages, we adopt a non-orthogonal multiple access (NOMA) framework and formulate a joint optimization problem to maximize the system's computational capacity by jointly optimizing device transmit power, time allocation, PA positions in both uplink and downlink, and radiation control. To address the resulting non-convexity caused by variable coupling, we develop an alternating optimization algorithm that integrates particle swarm optimization (PSO) with successive convex approximation. Simulation results demonstrate that the proposed PA-assisted design substantially improves both energy harvesting efficiency and computational performance compared to conventional antenna systems.
Authors: Ahmed Y. Radwan, Mustafa Yildirim, Navid Hasanzadeh, Hina Tabassum, Shahrokh Valaee
Wi-Fi technology has evolved from simple communication routers to sensing devices. Wi-Fi sensing leverages conventional Wi-Fi transmissions to extract and analyze channel state information (CSI) for applications like proximity detection, occupancy detection, activity recognition, and health monitoring. By leveraging existing infrastructure, Wi-Fi sensing offers a privacy-preserving, non-intrusive, and cost-effective solution which, unlike cameras, is not sensitive to lighting conditions. Beginning with a comprehensive review of the Wi-Fi standardization activities, this tutorial-cum-survey first introduces fundamental concepts related to Wi-Fi CSI, outlines the CSI measurement methods, and examines the impact of mobile objects on CSI. The mechanics of a simplified testbed for CSI extraction are also described. Then, we present a qualitative comparison of the existing Wi-Fi sensing datasets, their specifications, and pin-point their shortcomings. Next, a variety of preprocessing techniques are discussed that are beneficial for feature extraction and explainability of machine learning (ML) algorithms. We then provide a qualitative review of recent ML approaches in the domain of Wi-Fi sensing and present the significance of self-supervised learning (SSL) in that context. Specifically, the mechanics of contrastive and non-contrastive learning solutions is elaborated in detail and a quantitative comparative analysis is presented in terms of classification accuracy. Finally, the article concludes by highlighting emerging technologies that can be leveraged to enhance the performance of Wi-Fi sensing and opportunities for further research in this domain
Authors: Dan Sturm, Marzieyh Rezaei, Alana Dee, Sajjad Moazeni
Co-packaged optics (CPO) has emerged as a promising solution for achieving the ultra-high bandwidths, shoreline densities, and energy efficiencies required by future GPUs and network switches for AI. Microring modulators (MRMs) are well suited for transmitters due to their compact size, high energy efficiency, and natural compatibility with dense wavelength-division multiplexing (DWDM). However, extending beyond the recently demonstrated 200 Gb/s will require more advanced modulation formats, such as higher-order coherent modulation (e.g., QAM-16). In this work, we show how microring resonators (MRMs) can be efficiently used to implement phase-constant amplitude modulators and form the building blocks of a transmitter for offset QAM-16, which has been shown to simplify carrier-phase recovery relative to conventional QAM. We simulate and evaluate the performance of our proposed MRM-based coherent CPO (C2PO) transmitters using a foundry-provided commercial silicon photonics process, demonstrating an input-normalized electric field amplitude contrast of 0.64 per dimension. Through full link-level bit error rate modeling, we show that our design achieves 400 Gb/s using offset QAM-16 at a total optical laser power of 9.65 dBm-comparable to that required by conventional QAM-16 MZI-based links, despite using 10-100x less area. We further conduct a thermal simulation to assess the transmitter's thermal stability at the MRM input optical power required to meet a target BER at the desired data rates. Finally, as a proof of concept, we demonstrate 25 Gb/s MRM-based offset QAM-4 modulation with a chip fabricated in the GlobalFoundries 45 nm monolithic silicon photonics process.
Authors: Huanqiang Duan, Manno Versluis, Qinyu Chen, Leo C. N. de Vreede, Chang Gao
Digital predistortion (DPD) is essential for mitigating nonlinearity in RF power amplifiers, particularly for wideband applications. This paper presents TCN-DPD, a parameter-efficient architecture based on temporal convolutional networks, integrating noncausal dilated convolutions with optimized activation functions. Evaluated on the OpenDPD framework with the DPA_200MHz dataset, TCN-DPD achieves simulated ACPRs of -51.58/-49.26 dBc (L/R), EVM of -47.52 dB, and NMSE of -44.61 dB with 500 parameters and maintains superior linearization than prior models down to 200 parameters, making it promising for efficient wideband PA linearization.
Authors: Zhipeng Fan, Yujie Xu, Mingyu Fu, Han Sun, Weiqiu Zhang, Heng Zhang
This brief proposes a distributed formation control strategy via matrix-weighted Laplacian that can achieve a similar formation in 2-D planar using inter-agent relative displacement measurement. Formation patterns that include translation, rotation, and scaling can be characterized by the null space of the matrix-weighted Laplacian associated with the topological graph. The main contribution of this brief is to extend the similar formation problem of undirected graphs to directed acyclic graphs and provide the necessary algebraic criteria for leader selection. Stability analysis, illustrative examples, and simulation results are provided.
Authors: Francisco M. Arrabal-Campos, Francisco G. Montoya, Jorge Ventura, Santiago Sánchez-Acevedo, Raymundo E. Torres-Olguin, Francisco de León
This paper presents experimental validation of a time-domain load parameter determination method for single-phase circuits. The verification is performed in a state-of-the-art smart grid laboratory equipped with power hardware and real-time emulators. The proposed method enables the identification of circuit parameters using only instantaneous voltage and current measurements at the point of common coupling. The experimental setup includes a range of test cases covering linear and non-sinusoidal single-phase conditions. Voltage and current waveforms are acquired, preprocessed, and used to calculate the relevant circuit parameters. The experimental results demonstrate a high degree of accuracy and robustness, with minimal percentage errors across all test cases. The identified parameters show excellent agreement with the theoretical expectations, confirming the validity and applicability of the proposed method to identify the load of single-phase systems. This validation highlights the potential of the method for improved monitoring, control, and protection of smart grids, paving the way for future extensions to three-phase systems and real-time implementations.
Authors: Ryan Quach, Yidi Wang, Ali Jahanshahi, Daniel Wong, Hyoseung Kim
As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU's compute resources and waste power from idling components. To improve utilization and energy efficiency, multiple models can co-locate and share the GPU. However, typical GPU spatial partitioning techniques often experience significant overheads when reconfiguring spatial partitions, which can waste additional energy through repartitioning overheads or non-optimal partition configurations. In this paper, we present ECLIP, a framework to enable low-overhead energy-efficient kernel-wise resource partitioning between co-located inference kernels. ECLIP minimizes repartitioning overheads by pre-allocating pools of CU masked streams and assigns optimal CU assignments to groups of kernels through our resource allocation optimizer. Overall, ECLIP achieves an average of 13% improvement to throughput and 25% improvement to energy efficiency.
Authors: Florian Klein-Helmkamp, Tina Möllemann, Irina Zettl, Andreas Ulbig
The integration of distributed energy resources (DERs) into sub-transmission systems has enabled new opportunities for flexibility provision in ancillary services such as frequency and voltage support, as well as congestion management. This paper investigates the stability and performance of Online Feedback Optimization (OFO) controllers in ensuring reliable flexibility provision. A hierarchical control architecture is proposed, emphasizing safe transitions between system states within the Feasible Operating Region (FOR). We evaluate the controller's stability and performance through simulations of transitions to the vertices of the FOR, analyzing the impact of tuning parameters. The study demonstrates that controller stability is sensitive to parameter tuning, particularly gain and sensitivity approximations. Results demonstrate that improper tuning can lead to oscillatory or unstable behavior, highlighting the need for systematic parameter selection to ensure reliable operation across the full flexibility range.
Authors: Zeenat Hameed, Chresten Træholt
Battery Energy Storage Systems (BESS) are critical for modern power networks, supporting grid services such as frequency regulation, peak shaving, and black start. Delivering a BESS under an Engineering, Procurement, and Construction (EPC) model requires a concise methodology that balances regulatory compliance, technical details, and schedule efficiency. This paper presents a streamlined, five step EPC framework covering feasibility assessment, permitting, procurement, construction, and commissioning. A Danish demonstration (the BOSS project on Bornholm) serves as a case study.
Authors: Xiang Zhu, Hua Geng, Hongyang Qing, Xin Zou
This paper proposes a multi-objective optimization (MOO) approach for grid-level frequency regulation by aggregating inverter-based resources (IBRs). Virtual power plants (VPPs), acting as aggregators, can efficiently respond to dynamic response requirements from the grid. Through parametric modeling, grid-level frequency regulation requirements are accurately quantified and translated into a feasible parameter region defined by device-level parameters. Based on this feasible region, an MOO model is developed to address the conflicting demands of IBRs during frequency response. A Nash bargaining game-based approach is then employed to optimally allocate regulation requirements within the VPP, balancing the various demands of the IBRs. Numerical experiments demonstrate the effectiveness of the proposed method in enhancing frequency stability and improving coordination among IBRs.
Authors: Sushobhan Chatterjee, Sijia Geng
This paper investigates voltage stability in inverter-based power systems concerning fold and saddle-node bifurcations. An analytical expression is derived for the sensitivity of the stability margin using the normal vector to the bifurcation hypersurface. Such information enables efficient identification of effective control parameters in mitigating voltage instability. Comprehensive analysis reveals that reactive loading setpoint and current controller's feedforward gain are the most influential parameters for enhancing voltage stability in a grid-following (GFL) inverter system, while the voltage controller's feedforward gain plays a dominant role in a grid-forming (GFM) inverter. Notably, both theoretical and numerical results demonstrate that transmission line dynamics have no impact on fold/saddle-node bifurcations in these systems. Results in this paper provide insights for efficient analysis and control in future inverter-dominated power systems through reductions in parameter space and model complexity.
Authors: Nicals Tietze, Kai Wulff, Johann Reger
We consider trajectory tracking for minimum-phase nonlinear systems in Byrnes-Isidori form using the model-following control (MFC) architecture. The tracking problem is motivated by a hierarchical control concept where a higher-level instance provides the reference trajectory at run-time. We present a computational efficient implementation of the feedback linearisation MFC design, and apply high-gain feedback in the process control loop (PCL) to achieve practical tracking in presence of Lipschitz perturbations. Our main results establish ultimate boundedness of the tracking error and give a constructive bound for the high-gain scaling parameter to achieve arbitrary tracking precision. Further we establish that the peaking phenomenon can be attenuated using MFC. We demonstrate the results via an automotive case study considering advanced engine-based cruise control.
Authors: Junjin Lv, Chenggang Cui, Shaodi Zhang, Hui Chen, Chunyang Gong, Jiaming Liu
The Unit Commitment (UC) problem is a classic challenge in the optimal scheduling of power systems. Years of research and practice have shown that formulating reasonable unit commitment plans can significantly improve the economic efficiency of power systems' operations. In recent years, with the introduction of technologies such as machine learning and the Lagrangian relaxation method, the solution methods for the UC problem have become increasingly diversified, but still face challenges in terms of accuracy and robustness. This paper proposes a Function Space Search (FunSearch) method based on large language models. This method combines pre-trained large language models and evaluators to creatively generate solutions through the program search and evolution process while ensuring their rationality. In simulation experiments, a case of unit commitment with \(10\) units is used mainly. Compared to the genetic algorithm, the results show that FunSearch performs better in terms of sampling time, evaluation time, and total operating cost of the system, demonstrating its great potential as an effective tool for solving the UC problem.
Authors: Cong Chen, Omer Karaduman, Xu Kuang
Accurately modeling consumer behavior in energy operations remains challenging due to inherent uncertainties, behavioral complexities, and limited empirical data. This paper introduces a novel approach leveraging generative agents--artificial agents powered by large language models--to realistically simulate customer decision-making in dynamic energy operations. We demonstrate that these agents behave more optimally and rationally in simpler market scenarios, while their performance becomes more variable and suboptimal as task complexity rises. Furthermore, the agents exhibit heterogeneous customer preferences, consistently maintaining distinct, persona-driven reasoning patterns. Our findings highlight the potential value of integrating generative agents into energy management simulations to improve the design and effectiveness of energy policies and incentive programs.
Authors: Pantelis Dogoulis, Karim Tit, Maxime Cordy
In the modern context of power systems, rapid, scalable, and physically plausible power flow predictions are essential for ensuring the grid's safe and efficient operation. While traditional numerical methods have proven robust, they require extensive computation to maintain physical fidelity under dynamic or contingency conditions. In contrast, recent advancements in artificial intelligence (AI) have significantly improved computational speed; however, they often fail to enforce fundamental physical laws during real-world contingencies, resulting in physically implausible predictions. In this work, we introduce KCLNet, a physics-informed graph neural network that incorporates Kirchhoff's Current Law as a hard constraint via hyperplane projections. KCLNet attains competitive prediction accuracy while ensuring zero KCL violations, thereby delivering reliable and physically consistent power flow predictions critical to secure the operation of modern smart grids.
Authors: Lorenzo Lyons, Manuel Boldrer, Laura Ferranti
This paper presents a novel distributed vehicle platooning control and coordination strategy. We propose a distributed predecessor-follower CACC scheme that allows to choose an arbitrarily small inter-vehicle distance while guaranteeing no rear-end collisions occur, even in the presence of undetected cyber-attacks on the communication channels such as false data injection. The safety guarantees of the CACC policy are derived by combining a sensor-based ACC policy that explicitly accounts for actuator saturation, and a communication-based predictive term that has state-dependent limits on its control authority, thus containing the effects of an unreliable communication channel. An undetected attack may still however be able to degrade platooning performance. To mitigate it, we propose a tailored Kalman observer-based attack detection algorithm that initially triggers a switch from the CACC policy to the ACC policy. Subsequently, by relying on a high-level coordinator, our strategy allows to isolate a compromised vehicle from the platoon formation by reconfiguring the platoon topology itself. The coordinator can also handle merging and splitting requests. We compare our algorithm in an extensive simulation study against a state of the art distributed MPC scheme and a robust control scheme. We additionally extensively test our full method in practice on a real system, a team of scaled-down car-like robots. Furthermore, we share the code to run both the simulations and robotic experiments.
Authors: Cheng-Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan Po Huang, Hung-Yi Lee
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.
Authors: Peng Yang, Xiaoyu Peng, Xi Ru, Hua Geng, Feng Liu
Traditional centralized stability analysis struggles with scalability in large complex modern power grids. This two-part paper proposes a compositional and equilibrium-free approach to analyzing power system stability. In Part I, we prove that using equilibrium-free local conditions we can certificate system-wide stability of power systems with heterogeneous nonlinear devices and structure-preserving lossy networks. This is built on a recently developed notion of delta dissipativity, which yields local stability conditions without knowing the system-wide equilibrium. As a consequence, our proposed theory can certificate stability of equilibria set rather than single equilibrium. In Part I, we verify our theory and demonstrate promising implications by the single machine single load benchmark, which helps to better explain the compositional and equilibrium-set-oriented stability analysis. Part II of this paper will provide methods for applying our theory to complex power grids, together with case studies across a wide range of system scales. Our results enable a more scalable and adaptable approach to stability analysis. It also sheds light on how to regulate grid-connected devices to guarantee system-wide stability.
Authors: Peng Yang, Yifan Su, Xiaoyu Peng, Hua Geng, Feng Liu
This two-part paper proposes a compositional and equilibrium-free approach to analyzing power system stability. In Part I, we have established the stability theory and proposed stability conditions based on the delta dissipativity. In Part II, we focus on methods for applying our theory to complex power grids. We first propose a method to verify the local condition, i.e., delta dissipativity, for heterogeneous devices in power systems. Then, we propose a method to verify the coupling condition based on Alternating Direction Method of Multipliers (ADMM). Finally, we investigate three applications of our theory including stability assessment toward multiple equilibria, stability assessment under varying operating conditions, and a distributed computing framework. Case studies on modified IEEE 9-bus, 39-bus, and 118-bus benchmarks well verified our theory and methods.
Authors: Mohammad Soleymani, Ignacio Santamaria, Eduard Jorswieck, Robert Schober, Lajos Hanzo
Energy-efficient designs are proposed for multi-user (MU) multiple-input multiple-output (MIMO) broadcast channels (BC), assisted by simultaneously transmitting and reflecting (STAR) reconfigurable intelligent surfaces (RIS) operating at finite block length (FBL). In particular, we maximize the sum energy efficiency (EE), showing that STAR-RIS can substantially enhance it. Our findings demonstrate that the gains of employing STAR-RIS increase when the codeword length and the maximum tolerable bit error rate decrease, meaning that a STAR-RIS is more energy efficient in a system with more stringent latency and reliability requirements.
Authors: Xiang Zhu, Hua Geng, Hongyang Qing, Guangchun (Grant)Ruan, Xiuqiang He
The rapid integration of inverter-based resources (IBRs) into power systems has identified frequency security challenges due to reduced inertia and increased load volatility. This paper proposes a robust power reserve decision-making approach for dynamic virtual power plants (DVPPs) to address these challenges, especially under temporally sequential and uncertain disturbances. An analytical model is developed to characterize the system's frequency response dynamics, enabling the quantification of virtual inertia and virtual damping requirements to meet rate-of-change-of-frequency (RoCoF), frequency nadir, and steady-state deviation constraints. By analytically deriving the regulation power dynamics, the required virtual inertia and damping parameters for the DVPP are determined in a robust way. Then, the total power reserve decision is made by optimally allocating the parameters and calculating the actual power reserves for IBRs, fully considering their economic diversity. Finally, case studies conducted on an IEEE nine-bus system demonstrate the effectiveness of the proposed approach. The results indicate the high reliability of the proposed approach in ensuring frequency security.
Authors: Arthur Brigatto, Alexandre Street, Cristiano Fernandes, Davi Valladao, Guilherme Bodin, Joaquim Dias Garcia
Hydroelectricity accounted for roughly 61.4% of Brazil's total generation in 2024 and addressed most of the intermittency of wind and solar generation. Thus, inflow forecasting plays a critical role in the operation, planning, and market in this country, as well as in any other hydro-dependent power system. These forecasts influence generation schedules, reservoir management, and market pricing, shaping the dynamics of the entire electricity sector. The objective of this paper is to measure and present empirical evidence of a systematic optimistic bias in the official inflow forecast methodology, which is based on the PAR(p)-A model. Additionally, we discuss possible sources of this bias and recommend ways to mitigate it. By analyzing 14 years of historical data from the Brazilian system through rolling-window multistep (out-of-sample) forecasts, results indicate that the official forecast model exhibits statistically significant biases of 6%, 14%, 20%, and 24% for 1-, 6-, 12-, and 24-step-ahead forecasts in the Southeast subsystem, and 19%, 57%, 81%, and 109% in the Northeast subsystem. These findings uncover the limitations of current inflow forecasting methodologies used in Brazil and call for new governance and monitoring policies.
Authors: Shanthan Kumar Padisala, Satadru Dey
In autonomous electric vehicles (AEVs), battery energy must be judiciously allocated to satisfy primary propulsion demands and secondary auxiliary demands, particularly the Heating, Ventilation, and Air Conditioning (HVAC) system. This becomes especially critical when the battery is in a low state of charge under cold ambient conditions, and cabin heating and battery preconditioning (prior to actual charging) can consume a significant percentage of available energy, directly impacting the driving range. In such cases, one usually prioritizes propulsion or applies heuristic rules for thermal management, often resulting in suboptimal energy utilization. There is a pressing need for a principled approach that can dynamically allocate battery power in a way that balances thermal comfort, battery health and preconditioning, along with range preservation. This paper attempts to address this issue using real-time Model Predictive Control to optimize the power consumption between the propulsion, HVAC, and battery temperature preparation so that it can be charged immediately once the destination is reached.
Authors: Rongfei Li, Francis Assadian
Visual servoing technology has been well developed and applied in many automated manufacturing tasks, especially in tools' pose alignment. To access a full global view of tools, most applications adopt eye-to-hand configuration or eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing environment. Most research papers mainly put efforts into developing control and observation architectures in various scenarios, but few of them have discussed the importance of the camera's location in eye-to-hand configuration. In a manufacturing environment, the quality of camera estimations may vary significantly from one observation location to another, as the combined effects of environmental conditions result in different noise levels of a single image shot at different locations. In this paper, we propose an algorithm for the camera's moving policy so that it explores the camera workspace and searches for the optimal location where the images' noise level is minimized. Also, this algorithm ensures the camera ends up at a suboptimal (if the optimal one is unreachable) location among the locations already searched, with limited energy available for moving the camera. Unlike a simple brute force approach, the algorithm enables the camera to explore space more efficiently by adapting the search policy from learning the environment. With the aid of an image averaging technique, this algorithm, in use of a solo camera, achieves the observation accuracy in eye-to-hand configurations to a desirable extent without filtering out high-frequency information in the original image. An automated manufacturing application has been simulated and the results show the success of this algorithm's improvement of observation precision with limited energy.
Authors: Otacilio B. L. Neto, Michela Mulas, Iiro Harjunkoski, Francesco Corona
This work proposes an automatic control solution for the operation of conventional wastewater treatment plants (WWTPs) as energy-autonomous water resource recovery facilities. We first conceptualize a classification of the quality of treated water for three resource recovery applications (environmental, industrial, and agricultural water reuse). We then present an output-feedback model predictive controller (Output MPC) that operates a plant to produce water of specific quality class, while also producing sufficient biogas to ensure nonpositive energy costs. The controller is demonstrated in the long-term operation of a full-scale WWTP subjected to typical influent loads and periodically changing quality targets. Our results provide a proof-of-concept on the energy-autonomous operation of existing wastewater treatment infrastructure with control strategies that are general enough to accommodate a wide range of resource recovery objectives.
Authors: Sebastián Rojas-Innocenti, Enrique Baeyens, Alejandro Martín-Crespo, Sergio Saludes-Rodil, Fernando Frechoso Escudero
This paper presents a scenario based robust optimization framework for short term energy scheduling in electricity intensive industrial plants, explicitly addressing uncertainty in planning decisions. The model is formulated as a two-stage Mixed Integer Linear Program (MILP) and integrates a hybrid scenario generation method capable of representing uncertain inputs such as electricity prices, renewable generation, and internal demand. A convex objective function combining expected and worst case operational costs allows for tunable risk aversion, enabling planners to balance economic performance and robustness. The resulting schedule ensures feasibility across all scenarios and supports coordinated use of industrial flexibility assets, including battery energy storage and shiftable production. To isolate the effects of market volatility, the framework is applied to a real world cement manufacturing case study considering only day-ahead electricity price uncertainty, with all other inputs treated deterministically. Results show improved resilience to forecast deviations, reduced cost variability, and more consistent operations. The proposed method offers a scalable and risk-aware approach for industrial flexibility planning under uncertainty.
Authors: Francisco G. Montoya, Santiago Sánchez Acevedo
Various domains such as power system stability analysis, electric machine modeling, and control of power electronic converters have significantly benefited from the application of coordinate transformations. One of the main benefits is the dimensional reduction, which reduces the complexity of the problems. This paper introduces a novel general transformation based on a geometric framework that directly identifies the plane containing the locus for unbalanced quantities through bivector analysis using Geometric Algebra. The proposed method provides a direct transformation valid for any degree of unbalance in $n$-phase, $(n+1)$-wire sinusoidal systems. The transformation requires only two measurements (voltage or current) taken at different time instants, making it computationally efficient. Moreover, we demonstrate through pure geometric reasoning that our approach is general and encompasses other techniques, such as the classical Clarke transformation. Numerical simulations and experimental validation using a real-time digital simulator and a physical laboratory setup demonstrate the effectiveness of the proposed method. This generalization to multi-dimensional systems, combined with the reduced measurement requirements, represents a significant advancement over existing approaches that are typically restricted to three-phase applications or suffer from computational limitations.
Authors: Peng Zhang, Baosen Zhang
This paper address the optimal voltage control problem of distribution systems with high penetration of inverter-based renewable energy resources, under inaccurate model information. We propose the online exponential barrier method that explicitly leverages the online feedback from grids to enhance the robustness to model inaccuracy and incorporates the voltage constraints to maintain the safety requirements. We provide analytical results on the optimal barrier parameter selection and sufficient conditions for the safety guarantee of converged voltages. We also establish theoretical results on the exponential convergence rate with proper step-size. The effectiveness of the proposed framework is validated on a 56-bus radial network, where we significantly improve the robustness against model inaccuracy compared to existing methods.
Authors: Latif U. Khan, Maher Guizani, Sami Muhaidat, Choong Seon Hong
The rapid advancement of wireless networks has resulted in numerous challenges stemming from their extensive demands for quality of service towards innovative quality of experience metrics (e.g., user-defined metrics in terms of sense of physical experience for haptics applications). In the meantime, large language models (LLMs) emerged as promising solutions for many difficult and complex applications/tasks. These lead to a notion of the integration of LLMs and wireless networks. However, this integration is challenging and needs careful attention in design. Therefore, in this article, we present a notion of rational wireless networks powered by \emph{telecom LLMs}, namely, \emph{LLM-native wireless systems}. We provide fundamentals, vision, and a case study of the distributed implementation of LLM-native wireless systems. In the case study, we propose a solution based on double deep Q-learning (DDQN) that outperforms existing DDQN solutions. Finally, we provide open challenges.
Authors: Zeynab Kaseb, Matthias Moller, Peter Palensky, Pedro P. Vergara
This paper proposes a novel combinatorial optimization framework that reformulates existing power system problems into a format executable on quantum annealers. The proposed framework accommodates both normal and complex numbers and enables efficient handling of large-scale problems, thus ensuring broad applicability across power system problems. As a proof of concept, we demonstrate its applicability in two classical problems: (i) power system parameter identification, where we estimate the admittance matrix given voltage and current measurements, and (ii) power flow analysis, where we reformulate the nonlinear equations governing active and reactive power balance. The results show that the proposed framework effectively and efficiently solves both linear and nonlinear power system problems, and thus offers significant advantages in scenarios where traditional solvers face challenges, such as ill-conditioned systems and fault conditions.
Authors: Pablo Ramírez-Espinosa, David Morales-Jiménez, Beatriz Soret
Motivated by the stringent and challenging need for `greener communications' in increasingly power-hungry 5G networks, this paper presents a detailed energy efficiency analysis for three different multi-antenna architectures, namely fully-digital arrays, hybrid arrays, and dynamic metasurface antennas (DMAs). By leveraging a circuital model, which captures mutual coupling, insertion losses, propagation through the waveguides in DMAs and other electromagnetic phenomena, we design a transmit Wiener filter solution for the three systems. We then use these results to analyze the energy efficiency, considering different consumption models and supplied power, and with particular focus on the impact of the physical phenomena. DMAs emerge as an efficient alternative to classical arrays across diverse tested scenarios, most notably under low transmission power, strong coupling, and scalability requirements.
Authors: Ahmed Aboudonia, Johannes Estermann, Keith Moffat, Manfred Morari, John Lygeros
We aim to improve the energy efficiency of train climate control architectures, with a focus on a specific class of regional trains operating throughout Switzerland, especially in Zurich and Geneva. Heating, Ventilation, and Air Conditioning (HVAC) systems represent the second largest energy consumer in these trains after traction. The current architecture comprises a high-level rule-based controller and a low-level tracking controller. To improve train energy efficiency, we propose adding a middle data-driven predictive control layer aimed at minimizing HVAC energy consumption while maintaining passenger comfort. The scheme incorporates a multistep prediction model developed using real-world data collected from a limited number of train coaches. To validate the effectiveness of the proposed architecture, we conduct multiple experiments on a separate set of train coaches; our results suggest energy savings between 10% and 35% with respect to the current architecture.
Authors: Sijia Geng, Thomas Lee, Dharik Mallapragada, Audun Botterud
Electrified transportation leads to a tighter integration between transportation and energy distribution systems. In this work, we develop scalable optimization models to co-design hydrogen and battery electric vehicle (EV) fleets, distributed energy resources, and fast-charging and hydrogen-fueling infrastructure to efficiently meet transportation demands. A novel integer-clustering formulation is used for optimizing fleet-level EV operation while maintaining accurate individual vehicle dispatch, which significantly improves the computation efficiency with guaranteed performance. We apply the optimization model to Boston's public transit bus network using real geospatial data and cost parameters. Realistic insights are provided into the future evolution of coupled electricity-transportation-hydrogen systems, including the effects of electricity price structure, hydrogen fuel cost, carbon emission constraint, temperature effects on EV range, and distribution system upgrade cost.
Authors: Wenxin Liu, Jiakun Fang, Shichang Cui, Iskandar Abdullaev, Suyang Zhou, Xiaomeng Ai, Jinyu Wen
The growing coupling among electricity, gas, and hydrogen systems is driven by green hydrogen blending into existing natural gas pipelines, paving the way toward a renewable-dominated energy future. However, the integration poses significant challenges, particularly ensuring efficient and safe operation under varying hydrogen penetration and infrastructure adaptability. This paper reviews progress in optimization and control technologies for hydrogen-blended integrated gas-electricity system. First, key technologies and international demonstration projects are introduced to provide an overview of current developments. Besides, advances in gas-electricity system integration, including modeling, scheduling, planning and market design, are reviewed respectively. Then, the potential for cross-system fault propagation is highlighted, and practical methods for safety analysis and control are proposed. Finally, several possible research directions are introduced, aiming to ensure efficient renewable integration and reliable operation.
Authors: Eric Tönges, Philipp Härtel, Martin Braun
An approach is proposed to identify optimal asset protection strategies based on vulnerability assessment outcomes. Traditional bilevel attacker-defender models emphasize worstcase scenarios but offer limited defensive guidance. In contrast, trilevel models introduce high computational complexity and rely on fixed network configurations. The proposed critical-components method leverages vulnerability assessment results to determine protection strategies, effectively outsourcing the upper-level defense decision. This enables adaptability to diverse network topologies, assessment techniques, and cyber-physical energy systems without the overhead of multi-level optimization. Case studies demonstrate the potential for improved system resilience across varying operational conditions.
Authors: Asad Mahmood, Thang X. Vu, Wali Ullah Khan, Symeon Chatzinotas, Björn Ottersten
This work proposes a framework for the robust design of UAV-assisted wireless networks that combine 3D trajectory optimization with user mobility prediction to address dynamic resource allocation challenges. We proposed a sparse second-order prediction model for real-time user tracking coupled with heuristic user clustering to balance service quality and computational complexity. The joint optimization problem is formulated to maximize the minimum rate. It is then decomposed into user association, 3D trajectory design, and resource allocation subproblems, which are solved iteratively via successive convex approximation (SCA). Extensive simulations demonstrate: (1) near-optimal performance with $\epsilon \approx 0.67\%$ deviation from upper-bound solutions, (2) $16\%$ higher minimum rates for distant users compared to non-predictive 3D designs, and (3) $10-30\%$ faster outage mitigation than time-division benchmarks. The framework's adaptive speed control enables precise mobile user tracking while maintaining energy efficiency under constrained flight time. Results demonstrate superior robustness in edge-coverage scenarios, making it particularly suitable for $5G/6G$ networks.
Authors: Alex Pierron, Michel Barbeau, Luca De Cicco, Jose Rubio-Hernan, Joaquin Garcia-Alfaro
Reconfigurable Intelligent Surfaces (RISs) are composed of physical elements that can dynamically alter electromagnetic wave properties to enhance beamforming and leading to improvements in areas with low coverage properties. They have the potential to be combined with Reinforcement Learning (RL) techniques to achieve network performance and energy efficiency via optimization techniques. In addition to performance and energy improvements, it is also crucial to consider the concept of fair communications. RISs must ensure that User Equipment (UE) units receive their signals with adequate strength, without other UE being deprived of service due to insufficient power. In this paper, we address such a problem. We explore the fairness properties of previous work and propose a novel method that aims at obtaining an efficient and fair duplex RIS-RL system for multiple legitimate UE units. We report and discuss our experimental work and simulation results. We also release our code and datasets to foster further research in the topic.
Authors: Xun Li, Qiong Wu, Pingyi Fan, Kezhi Wang, Nan Cheng, Khaled B. Letaief
Edge caching is an emerging technology that empowers caching units at edge nodes, allowing users to fetch contents of interest that have been pre-cached at the edge nodes. The key to pre-caching is to maximize the cache hit percentage for cached content without compromising users' privacy. In this letter, we propose a federated learning (FL) assisted edge caching scheme based on lightweight architecture denoising diffusion probabilistic model (LDPM). Our simulation results verify that our proposed scheme achieves a higher cache hit percentage compared to existing FL-based methods and baseline methods.
Authors: Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac'h, Jiuguang Wang, Kuan Fang
The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present Reinforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation through flexible interlimb coordination. The key to our approach is an adaptive controller that seamlessly bridges the execution of manipulation motions and the generation of stable gaits based on task demands. Through the interplay between two controller modules, ReLIC dynamically assigns each limb for manipulation or locomotion and robustly coordinates them to achieve the task success. Using efficient reinforcement learning in simulation, ReLIC learns to perform stable gaits in accordance with the manipulation goals in the real world. To solve diverse and complex tasks, we further propose to interface the learned controller with different types of task specifications, including target trajectories, contact points, and natural language instructions. Evaluated on 12 real-world tasks that require diverse and complex coordination patterns, ReLIC demonstrates its versatility and robustness by achieving a success rate of 78.9% on average. Videos and code can be found at this https URL.
Authors: Huy Truong-Ba, Jacky Chin, Michael E. Cholette, Pietro Borghesani
Track geometry monitoring is essential for maintaining the safety and efficiency of railway operations. While Track Recording Cars (TRCs) provide accurate measurements of track geometry indicators, their limited availability and high operational costs restrict frequent monitoring across large rail networks. Recent advancements in on-board sensor systems installed on in-service trains offer a cost-effective alternative by enabling high-frequency, albeit less accurate, data collection. This study proposes a method to enhance the reliability of track geometry predictions by integrating low-accuracy sensor signals with degradation models through a Kalman filter framework. An experimental campaign using a low-cost sensor system mounted on a TRC evaluates the proposed approach. The results demonstrate that incorporating frequent sensor data significantly reduces prediction uncertainty, even when the data is noisy. The study also investigates how the frequency of data recording influences the size of the credible prediction interval, providing guidance on the optimal deployment of on-board sensors for effective track monitoring and maintenance planning.
Authors: Lorenzo Zapparoli, Alfredo Oneto, María Parajeles Herrera, Blazhe Gjorgiev, Gabriela Hug, Giovanni Sansavini
The decarbonization goals worldwide drive the energy transition of power distribution grids, which operate under increasingly volatile conditions and closer to their technical limits. In this context, localized operational data with high temporal and spatial resolution is essential for their effective planning and regulation. Nevertheless, information on grid-connected distributed energy resources, such as electric vehicles, photovoltaic systems, and heat pumps, is often fragmented, inconsistent, and unavailable. This work introduces a comprehensive database of distributed energy resources and non-controllable loads allocated in Switzerland's medium- and low-voltage distribution grid models, covering over 2 million points of connection. Remarkably, this data specifies the flexibility capabilities of the controllable devices, with a set of projections aligned with national forecasts for 2030, 2040, and 2050. The database supports studies on flexibility provision of distributed energy resources, distribution grid resilience, and national energy policy, among other topics. Importantly, its modular structure allows users to extract national- and local-scale information across medium- and low-voltage systems, enabling broad applicability across locations.
Authors: Masoud Behbahani, Alireza Fereidunian
This paper promises the idea of using DG (Distributed Generation) to improve the Metro load curve. Public transportation systems are often based on gasoline and diesel. However, with the gradual development in usage of the Metro and monorail, a new load with heavy demand, inappropriate load curve and middle LF (Load factor) is added to the electricity grid. In addition to supply problem of this massive consumer, the Metro load curve is another problem, which has a relatively low LF. Furthermore, Metro load peak hours coincide with the peaks of national grid. Improvement of the load curve is well-known in electrical engineering literature, which depending on the type of load curve, offers general recommendations in three approaches; DSM (Demand Side Management), DS (Distributed Storage) and DG. In this paper, to achieve quantitative indices of improvement for Metro load curve using DG, firstly based on the analysis of volume and consumption pattern of the main loads in Metro, the typical load curve has been extracted. Using this curve, the result of using DG is shown by quantitative parameters which represent the significant improvement in load curve. These parameters can be used to calculate economic indicators such as initial cost and ROI (Return of Investment).
Authors: Hiya Gada, Rupamathi Jaddivada, Marija Ilic
The widespread deployment of power electronic-based technologies is transforming modern power systems into fast, nonlinear, and heterogeneous systems. Conventional modeling and control approaches, rooted in quasi-static analysis and centralized control, are inadequate for these converter-dominated systems, which operate on fast timescales and involve proprietary models of diverse components. This paper adopts and extends a previously introduced energy space modeling framework grounded in energy conservation principles to address these challenges. We generalize the notion of a port interaction variable, which encodes energy exchange between interconnected, heterogeneous components in a unified and physically intuitive manner. A multilayered distributed control architecture is proposed, wherein the nonlinear physical dynamics of each component are lifted to a higher-level linear energy space through well-defined mappings. Distributed controllers are designed in this energy space using only local states and minimal neighbor information via port interaction variables. Two control designs, energy-based feedback linearizing control (FBLC) and sliding mode control (SMC), are proven to achieve asymptotic convergence to reference outputs. The approach is validated on two systems: an inverter-controlled RLC circuit and a synchronous generator connected to a load. In both cases, energy-based control improves transient response and reduces control effort.
Authors: Yuhao Lian, Xiao Han, Xinmao Deng
With the commercial deployment of 5G and the in-depth research of 6G, the demand for high-speed data services in the next-generation fiber optic access systems is growing increasingly. Passive optical networks (PONs) have become a research hotspot due to their characteristics of low loss, high bandwidth, and low cost. However, the traditional orthogonal multiple access (OMA-PON) has difficulty meeting the requirements of the next-generation PON for high spectral efficiency and flexibility. In this paper, a novel transmission technology, namely power-domain sparse dimension constellation multiple access (PD-SDCMA), is proposed for the first time. Through the signal space dimension selection strategy (S2D-strategy) in the high-dimensional signal space, the low-dimensional constellation is sparsely superimposed into the high-dimensional space, thereby reducing multi-user interference and enhancing the system capacity. PD-SDCMA supports higher-order modulation formats and more access groups, and is also compatible with the existing orthogonal frequency division multiplexing (OFDM) architecture. The simulation results show that in a 25 km single-mode fiber system, compared with PD-NOMA and 3D-NOMA, PD-SDCMA can support more users and significantly reduce BER. This technology provides an efficient and low-cost solution for the evolution of Flexible PONs.
Authors: Alex D. Hayes, Ryan J. Caverly
This paper presents an estimation and control framework that enables the targeted reentry of a drag-modulated spacecraft in the presence of atmospheric density uncertainty. In particular, an extended Kalman filter (EKF) is used to estimate the in-flight density errors relative to the atmospheric density used to generate the nominal guidance trajectory. This information is leveraged within a model predictive control (MPC) strategy to improve tracking performance, reduce control effort, and increase robustness to actuator saturation compared to the state-of-the-art approach. The estimation and control framework is tested in a Monte Carlo simulation campaign with historical space weather data. These simulation efforts demonstrate that the proposed framework is able to stay within 100 km of the guidance trajectory at all points in time for 98.4% of cases. The remaining 1.6% of cases were pushed away from the guidance by large density errors, many due to significant solar storms and flares, that could not physically be compensated for by the drag control device. For the successful cases, the proposed framework was able to guide the spacecraft to the desired location at the entry interface altitude with a mean error of 12.1 km and 99.7% of cases below 100 km.
Authors: Arash J. Khabbazi, Elias N. Pergantis, Levi D. Reyes Premer, Panagiotis Papageorgiou, Alex H. Lee, James E. Braun, Gregor P. Henze, Kevin J. Kircher
A large body of simulation research suggests that model predictive control (MPC) and reinforcement learning (RL) for heating, ventilation, and air-conditioning (HVAC) in residential and commercial buildings could reduce energy costs, pollutant emissions, and strain on power grids. Despite this potential, neither MPC nor RL has seen widespread industry adoption. Field demonstrations could accelerate MPC and RL adoption by providing real-world data that support the business case for deployment. Here we review 24 papers that document field demonstrations of MPC and RL in residential buildings and 80 in commercial buildings. After presenting demographic information -- such as experiment scopes, locations, and durations -- this paper analyzes experiment protocols and their influence on performance estimates. We find that 71% of the reviewed field demonstrations use experiment protocols that may lead to unreliable performance estimates. Over the remaining 29% that we view as reliable, the weighted-average cost savings, weighted by experiment duration, are 16% in residential buildings and 13% in commercial buildings. While these savings are potentially attractive, making the business case for MPC and RL also requires characterizing the costs of deployment, operation, and maintenance. Only 13 of the 104 reviewed papers report these costs or discuss related challenges. Based on these observations, we recommend directions for future field research, including: Improving experiment protocols; reporting deployment, operation, and maintenance costs; designing algorithms and instrumentation to reduce these costs; controlling HVAC equipment alongside other distributed energy resources; and pursuing emerging objectives such as peak shaving, arbitraging wholesale energy prices, and providing power grid reliability services.
Authors: Karl-Ludwig Besser, Rafael F. Schaefer, H. Vincent Poor
Resilience and power consumption are two important performance metrics for many modern communication systems, and it is therefore important to define, analyze, and optimize them. In this work, we consider a wireless communication system with secret-key generation, in which the secret-key bits are added to and used from a pool of available key bits. We propose novel physical layer resilience metrics for the survivability of such systems. In addition, we propose multiple power allocation schemes and analyze their trade-off between resilience and power consumption. In particular, we investigate and compare constant power allocation, an adaptive analytical algorithm, and a reinforcement learning-based solution. It is shown how the transmit power can be minimized such that a specified resilience is guaranteed. These results can be used directly by designers of such systems to optimize the system parameters for the desired performance in terms of reliability, security, and resilience.
Authors: Maurizio Clemente, Prapti Maharjan, Mauro Salazar, Theo Hofman
This paper investigates the environmental impact of Li-Ion batteries by quantifying manufacturing-related emissions and analyzing how electricity mix and production scale affect emission intensity. To this end, we conduct a meta-analysis of life cycle assessments on lithium-ion batteries published over the past two decades, categorizing them by year, battery chemistry, functional unit, system boundaries, and electricity mix. We then carry out a cradle-to-gate assessment for a nickel manganese cobalt 811 battery with a silicon-coated graphite anode, analyzing how variations in the carbon intensity of the electricity mix affect emissions, with case studies for China, South Korea, and Sweden. Finally, we develop a set of regression models that link annual battery production and the carbon intensity of China's electricity mix to the average mass-specific emissions observed each year. The meta-analysis shows a median global warming potential of 17.63 kg CO2-eq./kg of battery, with a standard deviation of 7.34. Differences in electricity mix mainly influence emissions from the energy-intensive cell production, particularly from cathode material processing. We found that a multivariate linear regression using production volume and the carbon intensity of the Chinese electricity mix as predictors explains emissions with moderate accuracy. The environmental impact of battery manufacturing can be reduced by using clean energy sources in production processes. However, achieving substantial reductions requires clean energy throughout the entire supply chain, as importing materials from regions with carbon-intensive electricity mixes can undermine these efforts. Our findings also highlight the emission-reducing effect of learning associated with increased production scale, supporting the integration of learning effects in future life cycle assessment models.
Authors: Xiaoyi Yuan, Qiming Huang, Mingqing Guo, Huiming Ma, Ming Xu, Zeyi Liu, Xiao He
With the rapid advancement of intelligent technologies, collaborative frameworks integrating large and small models have emerged as a promising approach for enhancing industrial maintenance. However, several challenges persist, including limited domain adaptability, insufficient real-time performance and reliability, high integration complexity, and difficulties in knowledge representation and fusion. To address these issues, an intelligent maintenance framework for industrial scenarios is proposed. This framework adopts a five-layer architecture and integrates the precise computational capabilities of domain-specific small models with the cognitive reasoning, knowledge integration, and interactive functionalities of large language models. The objective is to achieve more accurate, intelligent, and efficient maintenance in industrial applications. Two realistic implementations, involving the maintenance of telecommunication equipment rooms and the intelligent servicing of energy storage power stations, demonstrate that the framework significantly enhances maintenance efficiency.
Authors: Darío Slaifstein (1), Gautham Ram Chandra Mouli (1), Laura Ramirez-Elizondo (1), Pavol Bauer (1) ((1) Delft University of Technology)
In the context of building electrification, the operation of distributed energy resources integrating multiple energy carriers (electricity, heat, mobility) poses a significant challenge due to the nonlinear device dynamics, uncertainty, and computational issues. As such, energy management systems seek to decide set points for the primary control layer in the best way possible. The objective is to minimize and balance operative costs (energy bills or asset degradation) with user requirements (mobility, heating, etc.). This paper presents a novel aging-aware day-ahead algorithm for electrified buildings. The proposed energy management algorithm incorporates physics-based battery aging models to enhance the operational performance, making explicit the trade-off between grid cost and battery degradation. The proposed day-ahead algorithm can either cut-down on grid costs or extend battery lifetime (electric vehicle or static packs). Moreover, it exploits the differences between cathode chemistries improving grid costs by 25\% when using LFP cells, with respect to NMC cells. Finally the performance using aged batteries is also enhanced, with respect to the benchmarks.
Authors: Cheng Guo, Harsha Nagarajan, Merve Bodur
The Alternating Current Optimal Transmission Switching (ACOTS) problem incorporates line switching decisions into the AC Optimal Power Flow (ACOPF) framework, offering well-known benefits in reducing operational costs and enhancing system reliability. ACOTS optimization models contain discrete variables and nonlinear, non-convex constraints, which make it difficult to solve. In this work, we develop strengthened quadratic convex (QC) relaxations for ACOTS, where we tighten the relaxation with several new valid inequalities, including a novel kind of on/off cycle-based polynomial constraints by taking advantage of the network structure. We linearize the sum of on/off trilinear terms in the relaxation using extreme-point representation, demonstrating theoretical tightness, and efficiently incorporate on/off cycle-based polynomial constraints through disjunctive programming-based cutting planes. Combined with an optimization-based bound tightening algorithm, this results in the tightest QC-based ACOTS relaxation to date. We additionally propose a novel maximum spanning tree-based heuristic to improve the computational performance by fixing certain lines to be switched on. Our extensive numerical experiments on medium-scale PGLib instances show significant improvements on relaxation bounds, while tests on large-scale instances with up to 2,312 buses demonstrate substantial performance gains. To our knowledge, this is the first ACOTS relaxation-based approach to demonstrate near-optimal switching solutions on realistic large-scale power grid instances.
Authors: Adrien Petralia, Philippe Charpentier, Youssef Kadhi, Themis Palpanas
Millions of smart meters have been deployed worldwide, collecting the total power consumed by individual households. Based on these data, electricity suppliers offer their clients energy monitoring solutions to provide feedback on the consumption of their individual appliances. Historically, such estimates have relied on statistical methods that use coarse-grained total monthly consumption and static customer data, such as appliance ownership. Non-Intrusive Load Monitoring (NILM) is the problem of disaggregating a household's collected total power consumption to retrieve the consumed power for individual appliances. Current state-of-the-art (SotA) solutions for NILM are based on deep-learning (DL) and operate on subsequences of an entire household consumption reading. However, the non-stationary nature of real-world smart meter data leads to a drift in the data distribution within each segmented window, which significantly affects model performance. This paper introduces NILMFormer, a Transformer-based architecture that incorporates a new subsequence stationarization/de-stationarization scheme to mitigate the distribution drift and that uses a novel positional encoding that relies only on the subsequence's timestamp information. Experiments with 4 real-world datasets show that NILMFormer significantly outperforms the SotA approaches. Our solution has been deployed as the backbone algorithm for EDF's (Electricité De France) consumption monitoring service, delivering detailed insights to millions of customers about their individual appliances' power consumption. This paper appeared in KDD 2025.
Authors: Guozhen Zhu, Yuqian Hu, Chenshu Wu, Wei-Hsiang Wang, Beibei Wang, K. J. Ray Liu
WiFi-based home monitoring has emerged as a compelling alternative to traditional camera- and sensor-based solutions, offering wide coverage with minimal intrusion by leveraging existing wireless infrastructure. This paper presents key insights and lessons learned from developing and deploying a large-scale WiFi sensing solution, currently operational across over 10 million commodity off-the-shelf routers and 100 million smart bulbs worldwide. Through this extensive deployment, we identify four real-world challenges that hinder the practical adoption of prior research: 1) Non-human movements (e.g., pets) frequently trigger false positives; 2) Low-cost WiFi chipsets and heterogeneous hardware introduce inconsistencies in channel state information (CSI) measurements; 3) Motion interference in multi-user environments complicates occupant differentiation; 4) Computational constraints on edge devices and limited cloud transmission impede real-time processing. To address these challenges, we present a practical and scalable system, validated through comprehensive two-year evaluations involving 280 edge devices, across 16 scenarios, and over 4 million motion samples. Our solutions achieve an accuracy of 92.61% in diverse real-world homes while reducing false alarms due to non-human movements from 63.1% to 8.4% and lowering CSI transmission overhead by 99.72%. Notably, our system integrates sensing and communication, supporting simultaneous WiFi sensing and data transmission over home WiFi networks. While focused on home monitoring, our findings and strategies generalize to various WiFi sensing applications. By bridging the gaps between theoretical research and commercial deployment, this work offers practical insights for scaling WiFi sensing in real-world environments.
Authors: Junyi Duan, Jiageng Chen, Zuyuan He
Distributed fiber-optic acoustic sensing (DAS) has emerged as a transformative approach for distributed vibration measurement with high spatial resolution and long measurement range while maintaining cost-efficiency. However, the two-dimensional spatial-temporal DAS signals present analytical challenges. The abstract signal morphology lacking intuitive physical correspondence complicates human interpretation, and its unique spatial-temporal coupling renders conventional image processing methods suboptimal. This study investigates spatial-temporal characteristics and proposes a self-supervised pre-training framework that learns signals' representations through a mask-reconstruction task. This framework is named the DAS Masked AutoEncoder (DAS-MAE). The DAS-MAE learns high-level representations (e.g., event class) without using labels. It achieves up to 1% error and 64.5% relative improvement (RI) over the semi-supervised baseline in few-shot classification tasks. In a practical external damage prevention application, DAS-MAE attains a 5.0% recognition error, marking a 75.7% RI over supervised training from scratch. These results demonstrate the high-performance and universal representations learned by the DAS-MAE framework, highlighting its potential as a foundation model for analyzing massive unlabeled DAS signals.
Authors: Marta Vanin, Frederik Geth, Rahmat Heidari, Dirk Van Hertem
The impedances of cables and lines used in (multi-conductor) distribution networks are usually unknown or approximated, and may lead to problematic results for any physics-based power system calculation, e.g., (optimal) power flow. Learning parameters from time series data is one of the few available options to obtain improved impedance models. This paper presents an approach that combines statistical learning concepts with the exploitation of domain knowledge, in the form of Carson's equations, through nonlinear mathematical optimization. The proposed approach derives impedance matrices for up-to-four-wire systems, using measurement data like those obtained from smart meters. Despite the lack of phasor measurements, the low signal-to-noise ratio of smart meter measurements, and the inherent existence of multiple equivalent solutions, our method produces good quality impedance models that are fit for power system calculations, significantly improving on our previous work both in terms of accuracy and computational time.
Authors: Darren Leniston, David Ryan, Ammar Malik, Jack Jackman, Terence O'Donnell
As distributed energy resources (DERs) such as solar PV, batteries and electric vehicles become increasingly prevalent at the edge, maintaining grid stability requires advanced monitoring and control mechanisms. This paper presents a scalable smart grid gateway architecture that enables interoperability between Modbus-based inverters and IEEE 2030.5 cloud-based control systems. The proposed solution leverages Azure cloud services and edge-computing gateway devices to support dynamic configuration, telemetry ingestion, remote control and Volt-VAR Curve deployment. A microservice-based architecture ensures flexibility and scalability across diverse deployment scenarios, including both gateway-mediated and direct-to-cloud device communication. Results demonstrate the successful mapping of a Fronius Primo inverter's Modbus registers to IEEE 2030.5-compliant telemetry and control functions. Additionally, we evaluate real-time VVC updates and their impact on local voltage regulation, showcasing dynamic cloud-to-edge control with minimal latency. This work highlights the potential of virtualised, standards-based control infrastructures to support DER integration and active grid participation, while remaining adaptable to evolving smart grid architectures.
Authors: Philipp Härtel, Michael von Bonin
Electric vehicle (EV) fleets are expected to become an increasingly important source of flexibility for power system operations. However, accurately capturing the flexibility potential of numerous and heterogeneous EVs remains a significant challenge. We propose a bilevel optimization formulation to enhance flexibility aggregations of electric vehicle fleets. The outer level minimizes scheduling deviations between the aggregated and reference EV units, while the inner level maximizes the aggregated unit's profits. Our approach introduces hourly to daily scaling factor mappings to parameterize the aggregated EV units. Compared to simple aggregation methods, the proposed framework reduces the root-mean-square error of charging power by 78~per cent, providing more accurate flexibility representations. The proposed framework also provides a foundation for several potential extensions in future work.
Authors: Luiza Ribeiro, Alexandre Street, Jose Manuel Arroyo, Rodrigo Moreno
The increasing vulnerability of power systems has heightened the need for operating reserves to manage contingencies such as generator outages, line failures, and sudden load variations. Unlike energy costs, driven by consumer demand, operating reserve costs arise from addressing the most critical credible contingencies - prompting the question: how should these costs be allocated through efficient pricing mechanisms? As an alternative to previously reported schemes, this paper presents a new causation-based pricing framework for electricity markets based on contingency-constrained energy and reserve scheduling models. Major salient features include a novel security charge mechanism along with the explicit definition of prices for up-spinning reserves, down-spinning reserves, and transmission services. These features ensure more comprehensive and efficient cost-reflective market operations. Moreover, the proposed nodal pricing scheme yields revenue adequacy and neutrality while promoting reliability incentives for generators based on the cost-causation principle. An additional salient aspect of the proposed framework is the economic incentive for transmission assets, which are remunerated based on their use to deliver energy and reserves across all contingency states. Numerical results from two case studies illustrate the performance of the proposed pricing scheme.
Authors: Sonia Martin, Obidike Nnorom Jr., Philip Levis, Ram Rajagopal
Residential electric vehicle charging causes large spikes in electricity demand that risk violating neighborhood transformer power limits. Battery energy storage systems reduce these transformer limit violations, but operating them individually is not cost-optimal. Instead of individual optimization, aggregating, or sharing, these batteries leads to cost-optimal performance, but homeowners must relinquish battery control. This paper leverages virtualization to propose battery sharing optimization schemes to reduce electricity costs, extend the lifetime of a residential transformer, and maintain homeowner control over the battery. A case study with simulated home loads, solar generation, and electric vehicle charging profiles demonstrates that joint, or shared, optimization reduces consumer bills by 56% and transformer aging by 48% compared to individual optimization. Hybrid and dynamic optimization schemes that provide owners with autonomy have similar transformer aging reduction but are slightly less cost-effective. These results suggest that controlling shared batteries with virtualization is an effective way to delay transformer upgrades in the face of growing residential electric vehicle charging penetration.
Authors: Robert Bayer, Julian Priest, Daniel Kjellberg, Jeppe Lindhard, Nikolaj Sørenesen, Nicolaj Valsted, Ívar Óli, Pınar Tözün
CubeSats offer a low-cost platform for space research, particularly for Earth observation. However, their resource-constrained nature and being in space, challenge the flexibility and complexity of the deployed image processing pipelines and their orchestration. This paper introduces two novel systems, DIPP and DISH, to address these challenges. DIPP is a modular and configurable image processing pipeline framework that allows for adaptability to changing mission goals even after deployment, while preserving robustness. DISH is a domain-specific language (DSL) and runtime system designed to schedule complex imaging workloads on low-power and memory-constrained processors. Our experiments demonstrate that DIPP's decomposition of the processing pipelines adds negligible overhead, while significantly reducing the network requirements of updating pipelines and being robust against erroneous module uploads. Furthermore, we compare DISH to Lua, a general purpose scripting language, and demonstrate its comparable expressiveness and lower memory requirement.
Authors: Emmanuel O. Badmus, Amritanshu Pandey
This paper presents a \textit{physics-based} steady-state equivalent circuit model of a two-stage bidirectional inverter. These inverters connect distributed energy resources (DERs), such as photovoltaic (PV) and battery systems, to distribution grids. Existing inverter models have technical gaps on three fronts: i) inadequate modeling of inverter losses, ii) use of mathematical abstractions for bidirectional flow of power, and iii) inability to integrate different control modes into nonlinear solvers without loss of generality. We propose a physics-first model that explicitly captures losses in passive circuit components based on circuit-level principles. We enable bidirectional power flow without binary or complementarity constraints by formulating loss terms as smooth, sign-aware expressions of current. We introduce and parameterize controlled current sources with twice-differentiable continuous functions to enable inverter control modes without loss of generality. We integrate DERs with the proposed inverter model at the load buses of distribution networks to perform power flow and optimization studies on real-world distribution networks with over 20,000 nodes. We demonstrate that the proposed model is more accurate, integrates seamlessly with various control modes without loss of generality, and scales robustly to large optimization problems. Index Terms: bidirectional inverter model, circuit-based modeling, DERs, inverter efficiency, power control, steady-state analysis.
Authors: Sonia Martin, Obidike Nnorom Jr., Philip Levis, Ram Rajagopal
Residential electric vehicle charging causes large spikes in electricity demand that risk violating neighborhood transformer power limits. Battery energy storage systems reduce these transformer limit violations, but operating them individually is not cost-optimal. Instead of individual optimization, aggregating, or sharing, these batteries leads to cost-optimal performance, but homeowners must relinquish battery control. This paper leverages virtualization to propose battery sharing optimization schemes to reduce electricity costs, extend the lifetime of a residential transformer, and maintain homeowner control over the battery. A case study with simulated home loads, solar generation, and electric vehicle charging profiles demonstrates that joint, or shared, optimization reduces consumer bills by 56% and transformer aging by 48% compared to individual optimization. Hybrid and dynamic optimization schemes that provide owners with autonomy have similar transformer aging reduction but are slightly less cost-effective. These results suggest that controlling shared batteries with virtualization is an effective way to delay transformer upgrades in the face of growing residential electric vehicle charging penetration.
Authors: Muratkhan Abdirash, Xiaofan Cui
A DC microgrid is a promising alternative to the traditional AC power grid, since it can efficiently integrate distributed and renewable energy resources. However, as an emerging framework, it lacks the rigorous theoretical guarantees of its AC counterpart. In particular, safe stabilization of the DC microgrid has been a non-trivial task in power electronics. To address that, we take a control theoretic perspective in designing the feedback controller with provable guarantees. We present a systematic way to construct Control Lyapunov Functions (CLF) to stabilize the microgrid, and, independently, Control Barrier Functions (CBF) to enforce its safe operation at all times. The safety-critical controller (SCC) proposed in this work integrates the two control objectives, with safety prioritized, into a quadratic program (QP) as linear constraints, which allows for its online deployment using off-the-shelf convex optimization solvers. The SCC is compared against a robust version of the conventional droop control through numerical experiments whose results indicate the SCC outperforms the droop controller in guaranteeing safety and retaining stability at the same time.
Authors: Shiva Moshtagh, Behrouz Azimian, Mohammad Golgol, Anamitra Pal
Traditional optimization-based techniques for time-synchronized state estimation (SE) often suffer from high online computational burden, limited phasor measurement unit (PMU) coverage, and presence of non-Gaussian measurement noise. Although conventional learning-based models have been developed to overcome these challenges, they are negatively impacted by topology changes and real-time data loss. This paper proposes a novel deep geometric learning approach based on graph neural networks (GNNs) to estimate the states of PMU-unobservable power systems. The proposed approach combines graph convolution and multi-head graph attention layers inside a customized end-to-end learning framework to handle topology changes and real-time data loss. An upper bound on SE error as a function of topology change is also derived. Experimental results for different test systems demonstrate superiority of the proposed customized GNN-SE (CGNN-SE) over traditional optimization-based techniques as well as conventional learning-based models in presence of topology changes, PMU failures, bad data, non-Gaussian measurement noise, and large system implementation.
Authors: Hang Fan, Mingxuan Li, Jingshi Cui, Zuhan Zhang, Wencai Run, Dunnan Liu
The rapid growth of EVs and the subsequent increase in charging demand pose significant challenges for load grid scheduling and the operation of EV charging stations. Effectively harnessing the spatiotemporal correlations among EV charging stations to improve forecasting accuracy is complex. To tackle these challenges, we propose EV-LLM for EV charging loads based on LLMs in this paper. EV-LLM integrates the strengths of Graph Convolutional Networks (GCNs) in spatiotemporal feature extraction with the generalization capabilities of fine-tuned generative LLMs. Also, EV-LLM enables effective data mining and feature extraction across multimodal and multidimensional datasets, incorporating historical charging data, weather information, and relevant textual descriptions to enhance forecasting accuracy for multiple charging stations. We validate the effectiveness of EV-LLM by using charging data from 10 stations in California, demonstrating its superiority over the other traditional deep learning methods and potential to optimize load grid scheduling and support vehicle-to-grid interactions.
Authors: Anton Hinneck, David Pozo
The optimal transmission switching problem (OTSP) is an established problem of changing a power grid's topology to obtain an improved operation by controlling the switching status of transmission lines. This problem was proven to be NP-hard. Proposed solution techniques based on mixed-integer formulations can guarantee globally optimal solutions but are potentially intractable in realistic power grids. Heuristics methods cannot guarantee global optimality but can provide tractable solution approaches. This paper proposes solving the OTSP using exact formulations alongside parallel heuristics that generate good candidate solutions to speed up conventional branch-and-bound algorithms. The innovative aspect of this work is a new asynchronous parallel algorithmic architecture. A solver instance solving the full OTSP formulation is run in parallel to another process that asynchronously generates solutions to be injected into the full OTSP solution procedure during run time. Our method is tested on 14 instances of the pglib-opf library: The largest problem consisting of 13659 buses and 20467 branches. Our results show a good performance for large problem instances, with consistent improvements over off-the-shelf solver performance. We find that the method scales well with an increase in parallel processors.
Authors: Jose Leopoldo Contreras, Ola Shorinwa, Mac Schwager
We present SODA-MPC, a Safe, Out-of-Distribution-Adaptive Model Predictive Control algorithm, which uses an ensemble of learned models for prediction, with a runtime monitor to flag unreliable out-of-distribution (OOD) predictions. When an OOD situation is detected, SODA-MPC triggers a safe fallback control strategy based on reachability, yielding a control framework that achieves the high performance of learning-based models while preserving the safety of reachability-based control. We demonstrate the method in the context of an autonomous vehicle, driving among dynamic pedestrians, where SODA-MPC uses a neural network ensemble for pedestrian prediction. We calibrate the OOD signal using conformal prediction to derive an OOD detector with probabilistic guarantees on the false-positive rate, given a user-specified confidence level. During in-distribution operation, the MPC controller avoids collisions with a pedestrian based on the trajectory predicted by the mean of the ensemble. When OOD conditions are detected, the MPC switches to a reachability-based controller to avoid collisions with the reachable set of the pedestrian assuming a maximum pedestrian speed, to guarantee safety under the worst-case actions of the pedestrian. We verify SODA-MPC in extensive autonomous driving simulations in a pedestrian-crossing scenario. Our model ensemble is trained and calibrated with real pedestrian data, showing that our OOD detector obtains the desired accuracy rate within a theoretically-predicted range. We empirically show improved safety and improved task completion compared with two state-of-the-art MPC methods that also use conformal prediction, but without OOD adaptation. Further, we demonstrate the effectiveness of our method with the large-scale multi-agent predictor Trajectron++, using large-scale traffic data from the nuScenes dataset for training and calibration.
Authors: Niloofar Pourghaderi, Milad Kabirifar, Payman Dehghanian
Local electricity markets offer a promising solution for integrating renewable energy sources and other distributed energy resources (DERs) into distribution networks. These markets enable the effective utilization of flexible resources by facilitating coordination among various agents. Beyond technical and economic considerations, addressing social equity within these local communities is critical and requires dedicated attention in market-clearing frameworks. This paper proposes a social equity-based market-clearing framework for the optimal management of DERs' energy and flexibility within local communities. The proposed framework incorporates consumers' energy burden to ensure fair pricing in energy market clearance. Furthermore, to ensure equity during unbalanced operating conditions, flexible resources are managed in the local flexibility market, ensuring that all participants can trade power fairly under network disturbances. The model is formulated as a second-order cone programming (SOCP) optimization and validated on the IEEE 33-bus test distribution network.
Authors: Noah Rhodes, James Luedkte, Line Roald
Optimization problems that involve topology optimization in scenarios with large scale outages, such as post-disaster restoration or public safety power shutoff planning, are very challenging to solve. Using simple power flow representations such as DC power flow or network flow models results in low quality solutions which requires significantly higher-than-predicted load shed to become AC feasible. Recent work has shown that formulations based on the Second Order Cone (SOC) power flow formulation find very high quality solutions with low load shed, but the computational burden of these formulations remains a significant challenge. With the aim of reducing computational time while maintaining high solution quality, this work explores formulations which replace the conic constraints with a small number of linear cuts. The goal of this approach is not to find an exact power flow solution, but rather to identify good binary decisions, where the power flow can be resolved after the binary variables are fixed. We find that a simple reformulation of the Second Order Cone Optimal Power Shutoff problem can greatly improve the solution speed, but that a full linearization of the SOC voltage cone equation results in an overestimation of the amount of power that can be delivered to loads.
Authors: Junyi Tao, Ran Li, Salvador Pineda
Time-adaptive unit commitment (UC) has recently been investigated to reduce the scheduling costs by flexibly varying the temporal resolution, which is usually determined by clustering the net load patterns. However, there exists a misalignment between cost and net load patterns due to the discrete start-up costs and out-of-merit-order dispatch triggered by ramping and other constraints. The optimal time-adaptive resolution cannot be completely captured by clustering-based method. This paper proposes a cost-oriented method to address this misalignment by a novel bilevel optimization approach that is efficiently solved through a heuristic greedy algorithm. The impact of varying temporal resolution on the final scheduling costs are tested, based on which the temporal resolution is heuristically updated, achieving significant cost reduction without increasing the number of temporal periods. Subsequently, an improved discretized Adam optimization method together with offline warm start and online refinement strategy is proposed to efficiently search for the better temporal resolution configuration. Results show that the proposed cost-oriented UC temporal resolution determination method achieves enhanced cost efficiency.
Authors: Yixiang Huang, Jianhua Pei, Luocheng Chen, Zhenchang Du, Jinfu Chen, Zirui Peng
The proliferation of intermittent distributed renewable energy sources (RES) in modern power systems has fundamentally compromised the reliability and accuracy of deterministic net load forecasting. Generative models, particularly diffusion models, demonstrate exceptional potential in uncertainty quantification for scenario forecasting. Nevertheless, their probabilistic predictive capabilities and conditional bootstrapping mechanisms still remain underexplored. In this paper, a day-ahead probabilistic net load forecasting framework is developed by systematically quantifying epistemic uncertainty and aleatoric variability using the feature-informed enhanced conditional diffusion model (ECDM). The ECDM architecture implements the net load distribution generation process using an imputation-based conditional diffusion model, where multi-modal conditional inputs, such as weather and calendar data, are fused via cross-attention mechanisms. Specifically, historical net load profiles are utilized to guide the reverse diffusion trajectory through non-parametric imputation operators preserving spatial-temporal integrity. To capture periodic characteristics, a novel weekly arrangement method is also introduced, while an unconditional model is integrated to ensure diversity in the generated scenarios. Subsequently, the maximum probabilistic points and probability intervals of predicted net load are obtained by the adaptive kernel density estimation under RES intermittency. Moreover, ECDM is extented to multi-energy forecast framework, attempting to increase interpretability of the net load predictions. Numerical experiments on a publicly available dataset demonstrate the superior forecasting performance of the proposed method compared to existing state-of-the-art approaches.
Authors: Chu Han, Bingchao Zhao, Jiatai Lin, Shanshan Lyu, Longfei Wang, Tianpeng Deng, Cheng Lu, Changhong Liang, Hannah Y. Wen, Xiaojing Guo, Zhenwei Shi, Zaiyi Liu
Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To address these issues, we present a novel computation- and communication-efficient framework called Magnification-Aligned Global-Local Transformer (MAG-GLTrans). Our approach significantly reduces computational time, file transfer requirements, and storage overhead by enabling effective analysis using low-magnification inputs rather than high-magnification ones. The key innovation lies in our proposed magnification alignment (MAG) mechanism, which employs self-supervised learning to bridge the information gap between low and high magnification levels by effectively aligning their feature representations. Through extensive evaluation across various fundamental CPath tasks, MAG-GLTrans demonstrates state-of-the-art classification performance while achieving remarkable efficiency gains: up to 10.7 times reduction in computational time and over 20 times reduction in file transfer and storage requirements. Furthermore, we highlight the versatility of our MAG framework through two significant extensions: (1) its applicability as a feature extractor to enhance the efficiency of any CPath architecture, and (2) its compatibility with existing foundation models and histopathology-specific encoders, enabling them to process low-magnification inputs with minimal information loss. These advancements position MAG-GLTrans as a particularly promising solution for time-sensitive applications, especially in the context of intraoperative frozen section diagnosis where both accuracy and efficiency are paramount.
Authors: Tuan Anh Le, Ivan Ku, Xin-She Yang, Christos Masouros, Tho Le-Ngoc
This paper introduces a dual-function radar-communication (DFRC) system with cognitive radio capability to tackle the spectral scarcity problem in wireless communications. Particularly, a cognitive DFRC system operates on a spectrum owned by a primary system to simultaneously perform data communication and target tracking with the condition that its interference to the primary users (PUs) is below a certain threshold. To achieve this, an optimization problem is formulated to jointly design the beamforming vectors for both the radar and communication functions in such a way that the mean square error (MSE) of the beam pattern between the designed and desired waveforms is minimized. The optimization problem has the following three constraints: i) the signal-to-interference-plus-noise ratio (SINR) at each data communication user is above a predetermined level; ii) the per-antenna transmit power is maintained at a given level; iii) the interference imposed on each PU is below a certain threshold. Both the semidefinite relaxation and nature-inspired firefly algorithms are proposed in order to search for the optimal solutions to the optimization problem. The simulation results indicate that our proposed algorithms can enable the DFRC system to protect the PUs while simultaneously performing its communication and radar functions.
Authors: Yang Zhao, Yue Xiu, Chengxiao Dai, Ning Wei, Dusit Niyato
Large language model (LLM) training in 6G networks faces stringent latency and energy constraints while operating over bandwidth-limited wireless links. A commonly adopted workflow separates training into a centralized pre-training phase and a federated fine-tuning phase on domain-specific data; however, over-the-air (OTA) gradient aggregation during fine-tuning remains vulnerable to fading and interference. This study explores the integration of movable antennas (MAs), whose element positions can be reconfigured in real time, to mitigate such channel impairments. An auxiliary channel representation embeds transmit power terms in the effective gain, thereby removing explicit power-control variables. We derive the convergence bound that determines the relationship between the final fine-tuning loss to OTA noise and the distribution shift between the two data stages, measured via the Wasserstein distance. These findings lead to a mixed integer, nonconvex resource allocation problem that jointly determines the numbers of global rounds, CPU frequencies, mini-batch sizes, positions of MAs, and beamformers under latency-energy constraints. We propose a hybrid successive convex approximation (SCA) and penalty dual decomposition (PDD) algorithm to solve the problem. Experiments with the OpenLLaMA v2 model with 3 billion parameters demonstrate up to $95\%$ faster convergence and over $90\%$ lower total energy consumption relative to the leading wireless federated learning baselines, underscoring the promise of MAs-assisted federated LLM fine-tuning for 6G edge intelligence.
Authors: Na Xue, Xidong Mu, Yue Chen, Yuanwei Liu
A simultaneously transmitting and reflecting surface (STARS) assisted near-field (NF) integrated sensing and communication (ISAC) framework is proposed, where the radio sensors are installed on the STARS to directly conduct the distance-domain sensing by exploiting the characteristic spherical wavefront. A new squared position error bound (SPEB) expression is derived to reveal the dependence on beamforming (BF) design and sensor deployment. To balance the trade-off between the SPEB and the sensor deployment cost, a cost function minimization problem, a cost function minimization problem is formulated to jointly optimize the sensor deployment, the active and passive BF, subject to communication and power consumption constraints. For the sensor deployment optimization, a joint sensor deployment algorithm is proposed by invoking the successive convex approximation. Under a specific relationship between the sensor numbers and BF design, we derive the optimal sensor interval in a closed-form expression. For the joint BF optimization, a penalty-based method is invoked. Simulation results validated that the derived SPEB expression is close to the exact SPEB, which reveals the Fisher information Matrix of position estimation in NF can be approximated as a diagonal matrix. Furthermore, the proposed algorithms achieve the best SPEB performance than the benchmark schemes accompanying the lowest deployment cost.
Authors: Na Xue, Xidong Mu, Yue Chen, Yuanwei Liu
A novel uniform circular array (UCA) based near-field (NF) integrated sensing and communication (ISAC) framework is proposed, where the Cylindrical coordinate is invoked to evaluate the joint positioning performance. The joint squared position error bound (SPEB) of the sensing target (ST) is derived for the coplanar and non-coplanar cases. For the coplanar case, where the ST is located in the coplanar region of the UCA, the approximate Cram{é}r-Rao bound (CRB) expressions for the separate angle and distance estimation are given by exploiting the uniform spherical wavefront model. A SPEB minimization problem is formulated with the constraints of communication requirement and power budget, where the closed-form solution to minimize the CRB of the angle is derived. Inspired by the close-form expression, a low complexity vector-based quadratic transformation (VQF) algorithm is proposed by invoking the Rayleigh quotient. For the non-coplanar case, where the ST is located beyond the coplanar region of the UCA, the separate CRBs over three-dimensional coordinates and the joint SPEB approximations are derived. To minimize the SPEB performance, the semi-definite relaxation (SDR) method and extended low-complexity VQF algorithm are proposed. Numerical results validated that i) the Fisher Information Matrix about angle and distance in NF propagation can be approximated as a diagonal matrix with the trinity loss; ii) Compared with the uniform planar array, the UCA achieve better positioning performance when ST located in the coplanar of the antenna array; and iii) the proposed VQF algorithms reach higher solution precision than conventional SDR algorithm with much less computation complexity.
Authors: Mahmoud Elgenedy
Complexity of Neural Networks is increasing rapidly due to the massive increase in model parameters. Specifically, in Large Language Models (LLMs), the number of model parameters has grown exponentially in the past few years, for example, from 1.5 billion parameters in GPT2 to 175 billion in GPT3. This raises a significant challenge for implementation, especially for Edge devices where memory and processing power are very limited. In this work, we investigate reducing LLM complexity with special type of quantization, power of two (PoT), for linear layers weights and transformer tables. PoT not only provides memory reduction but more importantly provides significant computational reduction through converting multiplication to bit shifting. We obtained preliminary results of PoT quantization on Nano-GPT implementation using Shakespeare dataset. We then extended results to 124-M GPT-2 model. The PoT quantization results are shown to be very promising with cross entropy loss degradation $\approx$[1.3-0.88] with number of bits range [4-6] to represent power levels.
Authors: Jiawang Li
In this communication, two novel low-cost single-layer filtering antennas (filtennas) are proposed for millimeter wave (mmWave) applications. The proposed filtennas consists of a compact circular substrate integrated waveguide (SIW) cavity, a metal post close to the center of the cavity for power feeding, a metal post in the center for modes controlling, and a slot for radiating power. In the passband, the fundamental TM010 mode and the TM110 mode in the circular SIW cavity are excited by the feeding post. In addition, thanks to the high-pass characteristics of the cavity, it exhibits more than 20 dB suppression in the lower frequency band. There are three radiation nulls in Filtenna 1 and one radiation null in Filtenna 2 in the upper band which increase the suppression level as high as 18 dB. As a proof of concept, the proposed filtennas are fabricated and measured. It is shown that the Filtenna 1 can achieve simulated and measured -10 dB impedance fractional bandwidth (FBW) of 7.1% (27.14 - 29.13 GHz) and 8.6% (27.62 - 30.11 GHz), respectively. While filtenna 2 can achieve simulated and measured -10 dB FBW of 7.4% (27.86 - 29.99 GHz) and 10.1% (28.11 - 31.09 GHz), respectively. The filtennas features stable radiation patterns with an average gain of 5.0 dBi. The lower and upper sideband suppression levels for both filtennas exceed 18 dB. These filtennas are good candidates for 5G mmWave applications, as they simultaneously provide beam scanning and filtering capability with a low cost, and single layer structure.
Authors: Ngoc Long Pham, Tri Nhu Do
Neural network (NN)-based end-to-end (E2E) communication systems, in which each system component may consist of a portion of a neural network, have been investigated as potential tools for developing artificial intelligence (Al)-native E2E systems. In this paper, we propose an NN-based bitwise receiver that improves computational efficiency while maintaining performance comparable to baseline demappers. Building on this foundation, we introduce a novel symbol-wise autoencoder (AE)-based E2E system that jointly optimizes the transmitter and receiver at the physical layer. We evaluate the proposed NN-based receiver using bit-error rate (BER) analysis to confirm that the numerical BER achieved by NN-based receivers or transceivers is accurate. Results demonstrate that the AE-based system outperforms baseline architectures, particularly for higher-order modulation schemes. We further show that the training signal-to-noise ratio (SNR) significantly affects the performance of the systems when inference is conducted at different SNR levels.
Authors: Satyavrat Wagle, Akshay Malhotra, Shahab Hamidi-Rad, Aditya Sant, David J.Love, Christopher G. Brinton
In recent years, machine learning (ML) methods have become increasingly popular in wireless communication systems for several applications. A critical bottleneck for designing ML systems for wireless communications is the availability of realistic wireless channel datasets, which are extremely resource-intensive to produce. To this end, the generation of realistic wireless channels plays a key role in the subsequent design of effective ML algorithms for wireless communication systems. Generative models have been proposed to synthesize channel matrices, but outputs produced by such methods may not correspond to geometrically viable channels and do not provide any insight into the scenario being generated. In this work, we aim to address both these issues by integrating established parametric, physics-based geometric channel (PPGC) modeling frameworks with generative methods to produce realistic channel matrices with interpretable representations in the parameter domain. We show that generative models converge to prohibitively suboptimal stationary points when learning the underlying prior directly over the parameters due to the non-convex PPGC model. To address this limitation, we propose a linearized reformulation of the problem to ensure smooth gradient flow during generative model training, while also providing insights into the underlying physical environment. We evaluate our model against prior baselines by comparing the generated, scenario-specific samples in terms of the 2-Wasserstein distance and through its utility when used for downstream compression tasks.
Authors: Yameng Liu, Jianhua Zhang, Yuxiang Zhang, Hongbo Xing, Yifeng Xiong, Zhiqiang Yuan, Guangyi Liu
Integrated Sensing And Communication (ISAC) has been identified as a key 6G application by ITU and 3GPP, with standardization efforts already underway. Sensing tasks, such as target localization, demand more precise characterization of the sensing target (ST) in ISAC channel modeling. The ST couples complexly with environmental scatterers, potentially blocking some multipaths and generating new ones, resulting in power variations compared to the original channel. To accurately model this effect, this paper proposes a coupled ISAC channel model based on measurements and validates it through similarity analysis between simulated and measured channels. In this work, we first conduct ISAC channel measurements in an indoor factory scenario at 105 GHz, where the multipath power variations caused by the ST's interaction with the environment are clearly observed. Then, we propose an ISAC channel modeling framework that incorporates two novel parameters: the Blockage-Region Coupling Factor (BR-CF) and the Forward-Scattering (FS)-CF, which characterize the spatial region and intensity of the coupling effect, respectively. Finally, the proposed model is validated through similarity comparison with measured data, demonstrating higher accuracy for both LoS and NLoS scenarios compared to the non-coupled model. This realistic ISAC channel model provides an effective framework for capturing the ST-environment coupling effect, supporting the design and evaluation of ISAC technologies.
Authors: Yinchao Yang, Zhaohui Yang, Chongwen Huang, Wei Xu, Zhaoyang Zhang, Dusit Niyato, Mohammad Shikh-Bahaei
This paper introduces a novel framework for integrated sensing, computing, and semantic communication (ISCSC) within vehicular networks comprising a roadside unit (RSU) and multiple autonomous vehicles. Both the RSU and the vehicles are equipped with local knowledge bases to facilitate semantic communication. The framework incorporates a secure communication design to ensure that messages intended for specific vehicles are protected against interception. In this model, an extended Kalman filter (EKF) is employed by the RSU to accurately track all vehicles. We formulate a joint optimization problem that balances maximizing the probabilistically constrained semantic secrecy rate for each vehicle while minimizing the sum of the posterior Cramér-Rao bound (PCRB), subject to the RSU's computing capabilities. This non-convex optimization problem is addressed using Bernstein-type inequality (BTI) and alternating optimization (AO) techniques. Simulation results validate the effectiveness of the proposed framework, demonstrating its advantages in reliable sensing, high data throughput, and secure communication.
Authors: Boyu Teng, Xiaojun Yuan, Rui Wang, Ying-Chang Liang
Extremely large antenna array (ELAA) not only effectively enhances system communication performance but also improves the sensing capabilities of communication systems, making it one of the key enabling technologies in 6G wireless networks. This paper investigates the multiuser localization problem in an uplink Multiple Input Multiple Output (MIMO) system, where the base station (BS) is equipped with an ELAA to receive signals from multiple single-antenna users. We exploit analog beamforming to reduce the number of radio frequency (RF) chains. We first develop a comprehensive near-field ELAA channel model that accounts for the antenna radiation pattern and free space path loss. Due to the large aperture of the ELAA, the angular resolution of the array is high, which improves user localization accuracy. However, it also makes the user localization problem highly non-convex, posing significant challenges when the number of RF chains is limited. To address this issue, we use an array partitioning strategy to divide the ELAA channel into multiple subarray channels and utilize the geometric constraints between user locations and subarrays for probabilistic modeling. To fully exploit these geometric constraints, we propose the array partitioning-based location estimation with limited measurements (APLE-LM) algorithm based on the message passing principle to achieve multiuser localization. We derive the Bayesian Cramer-Rao Bound (BCRB) as the theoretical performance lower bound for our formulated near-field multiuser localization problem. Extensive simulations under various parameter configurations validate the proposed APLE-LM algorithm. The results demonstrate that APLE-LM achieves superior localization accuracy compared to baseline algorithms and approaches the BCRB at high signal-to-noise ratio (SNR).
Authors: Jiadong He, Liang Yu, Zhiqiang Chen, Dawei Qiu, Dong Yue, Goran Strbac, Meng Zhang, Yujian Ye, Yi Wang
This letter proposes an Adversarial Inverse Reinforcement Learning (AIRL)-based energy management method for a smart home, which incorporates an implicit thermal dynamics model. In the proposed method, historical optimal decisions are first generated using a neural network-assisted Hierarchical Model Predictive Control (HMPC) framework. These decisions are then used as expert demonstrations in the AIRL module, which aims to train a discriminator to distinguish expert demonstrations from transitions generated by a reinforcement learning agent policy, while simultaneously updating the agent policy that can produce transitions to confuse the discriminator. The proposed HMPC-AIRL method eliminates the need for explicit thermal dynamics models, prior or predictive knowledge of uncertain parameters, or manually designed reward functions. Simulation results based on real-world traces demonstrate the effectiveness and data efficiency of the proposed method.
Authors: Xiaochun Ge, Wenqian Shen, Chengwen Xing, Lian Zhao, Jianping An
Training beam design for channel estimation with infinite-resolution and low-resolution phase shifters (PSs) in hybrid analog-digital milimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems is considered in this paper. By exploiting the sparsity of mmWave channels, the optimization of the sensing matrices (corresponding to training beams) is formulated according to the compressive sensing (CS) theory. Under the condition of infinite-resolution PSs, we propose relevant algorithms to construct the sensing matrix, where the theory of convex optimization and the gradient descent in Riemannian manifold is used to design the digital and analog part, respectively. Furthermore, a block-wise alternating hybrid analog-digital algorithm is proposed to tackle the design of training beams with low-resolution PSs, where the performance degeneration caused by non-convex constant modulus and discrete phase constraints is effectively compensated to some extent thanks to the iterations among blocks. Finally, the orthogonal matching pursuit (OMP) based estimator is adopted for achieving an effective recovery of the sparse mmWave channel. Simulation results demonstrate the performance advantages of proposed algorithms compared with some existing schemes.
Authors: Kiarash Hassas Irani, Sergiy A. Vorobyov, Yongwei Huang
Distributionally robust optimization (DRO)-based robust adaptive beamforming (RAB) enables enhanced robustness against model uncertainties, such as steering vector mismatches and interference-plus-noise covariance matrix estimation errors. Existing DRO-based RAB methods primarily rely on uncertainty sets characterized by the first- and second-order moments. In this work, we propose a novel Wasserstein DRO-based beamformer, using the worst-case signal-to-interference-plus-noise ratio maximization formulation. The proposed method leverages the Wasserstein metric to define uncertainty sets, offering a data-driven characterization of uncertainty. We show that the choice of the Wasserstein cost function plays a crucial role in shaping the resulting formulation, with norm-based and Mahalanobis-like quadratic costs recovering classical norm-constrained and ellipsoidal robust beamforming models, respectively. This insight highlights the Wasserstein DRO framework as a unifying approach, bridging deterministic and distributionally robust beamforming methodologies.
Authors: Buyi Yu, Wenyuan Tang
Planning and scheduling activities in the electrical power system, such as the commitment of reserve generation, often involve statistical characterization of peak demand. Extreme Value Analysis (EVA)-based probabilistic assessments of annual peaks are widely adopted by energy regulatory and oversight agencies to determine the likelihood and severity of potential energy shortfalls. Due to the inability of classical EVA to account for peak distributions that change with annual extreme temperatures, popular existing approaches apply EVA on simulated annual peaks created by weather-dependent surrogate models using Monté-Carlo simulations on a per-scenario basis. In higher time resolutions such as day-ahead scheduling, the daily peak demand changes upon various factors besides temperature, Monté-Carlo experiments become intractable, and EVA-based modeling manifests as a methodological vacuum. This article explores uncharted territories and pioneers an unparalleled nonstationary EVA estimator that predicts the probable peaks of high-resolution time intervals and their corresponding conditional probability densities based on calendar information and weather conditions where historical peaks are observed. We present a case study on the determination of day-ahead scheduling capacity and demonstrate that compared to the industry approach, our approach results in a $38\%$ reduction in the yearly total committed capacity while maintaining the given risk requirement.
Authors: Kürşat Tekbıyık, Amir Hossein Fahim Raouf, İsmail Güvenç, Mingzhe Chen, Güneş Karabulut Kurt, Antoine Lesage-Landry
Low Altitude Economy (LAE) networks own transformative potential in urban mobility, emergency response, and aerial logistics. However, these networks face significant challenges in spectrum management, interference mitigation, and real-time coordination across dynamic and resource-constrained environments. After addressing these challenges, this study explores three core elements for enabling intelligent LAE networks as follows machine learning-based spectrum sensing and coexistence, artificial intelligence (AI)-optimized resource allocation and trajectory planning, and testbed-driven validation and standardization. We highlight how federated and reinforcement learning techniques support decentralized, adaptive decision-making under mobility and energy constraints. In addition, we discuss the role of real-world platforms such as AERPAW in bridging the gap between simulation and deployment and enabling iterative system refinement under realistic conditions. This study aims to provide a forward-looking roadmap toward developing efficient and interoperable AI-driven LAE ecosystems.
Authors: Christopher Bohn, Manuel Hess, Sören Hohmann
This paper presents a method that addresses the conservatism, computational effort, and limited numerical accuracy of existing frameworks and methods that ensure safety in online model-based motion generation, commonly referred to as fast and safe tracking. Computational limitations restrict online motion planning to low-fidelity models. However, planning with low-fidelity models compromises safety, as the dynamic feasibility of resulting reference trajectories is not ensured. This potentially leads to unavoidable tracking errors that may cause safety-critical constraint violations. Existing frameworks mitigate this safety risk by augmenting safety-critical constraints in motion planning by a safety margin that prevents constraint violations under worst-case tracking errors. However, the methods employed in these frameworks determine the safety margin based on a heuristically selected performance of the planning model, which likely results in overly conservative reference trajectories. Furthermore, these methods are computationally intensive, and the state-of-the-art method is limited in numerical accuracy. We adopt a different perspective and address these limitations with a method that mitigates conservatism in existing frameworks by adapting the planning model performance to a given safety margin. Our method achieves numerical accuracy and requires significantly less computation time than existing methods by leveraging a captivity-escape game, which is a specific zero-sum differential game formulated in this paper. We demonstrate our method using a numerical example and compare it to the state of the art.
Authors: Ahmed Naeem, Anastassia Gharib, Hüseyin Arslan
Water-filling (WF) algorithms are pivotal in maximizing capacity and spectral efficiency in multiple-input and multiple-output (MIMO) systems. However, traditional WF approaches cater solely to communication requirements, neglecting the emerging heterogeneity of 6G, including sensing and joint radar-communication (JRC). As these diverse demands grow in importance and have different Quality of Service (QoS) constraints, traditional WF becomes inadequate. Therefore, in this paper, we propose a unified interference-aware and QoS-constrained WF algorithm for systems with communication, sensing, and JRC. The proposed algorithm enables power allocation for multi-user MIMO systems, effectively addressing interference and balancing the support for heterogeneous user requirements.
Authors: Juan Pablo Bertucci, Sudarshan Raghuraman, Mauro Salazar, Theo Hofman
The major challenges to battery electric truck adoption are their high cost and grid this http URL this context, stationary energy storage systems can help mitigate both issues. Since their design and operation are strongly coupled, to make the best out of them, they should be jointly optimized. This paper presents a co-design framework for hybrid energy storage systems where their technology and sizing are optimized jointly with their operational strategies. Specifically, we consider a microgrid supporting truck chargers that consists of utility grid, solar panels, and energy storage systems including batteries, supercapacitors and flywheels. We frame the co-design problem as a mixed-integer linear program that can be solved with global optimality guarantees. We showcase our framework in a case-study of a distribution center in the Netherlands. Our results show that although the battery-only configuration is already competitive, adding supercapacitors or flywheel storage decrease total cost and increase energy sold back to the grid. Overall, the fully hybrid solution (Battery+Supercapacitors+Flywheel) offers the best outcomes, achieving the lowest overall cost (1.96\% lower compared to battery-only) and reduced grid dependency, but at a higher (2.6\%) initial investment.
Authors: Wang Dai, Archontis Politis, Tuomas Virtanen
We propose a novel approach that utilize inter-speaker relative cues for distinguishing target speakers and extracting their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categories. Relative cues offers greater flexibility than fixed speech attribute classification, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues like pitch level, loudness, distance, speaking duration, language, and pitch range also demonstrate notable benefit in complex scenarios. Fine-tuning pre-trained WavLM Base+ CNN encoders improves overall performance over the baseline of using only a Conv1d encoder.
Authors: Herman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau
We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-quality voice conversion. This suggests that content information is embedded in a low-dimensional subspace which can be linearly transformed to produce a target voice. To validate this hypothesis, we finally propose a method that explicitly factorizes content and speaker information using singular value decomposition; the resulting linear projection with a rank of just 100 gives competitive conversion results. Our work has implications for both practical voice conversion and a broader understanding of self-supervised speech representations. Samples and code: this https URL.
Authors: Finn G. Maurer, Erlend A. Basso, Henrik M. Schmidt-Didlaukies, Torleiv H. Bryne
This paper derives the extended Kalman filter (EKF) for continuous-time systems on matrix Lie groups observed through discrete-time measurements. By modeling the system noise on the Lie algebra and adopting a Stratonovich interpretation for the stochastic differential equation (SDE), we ensure that solutions remain on the manifold. The derivation of the filter follows classical EKF principles, naturally integrating a necessary full-order covariance reset post-measurement update. A key contribution is proving that this full-order covariance reset guarantees that the Lie-group-valued state estimate is invariant to whether a left- or right-invariant error definition is used in the EKF. Monte Carlo simulations of the aided inertial navigation problem validate the invariance property and confirm its absence when employing reduced-order covariance resets.
Authors: Wangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Wei Wang, Yihui Fu, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and evaluation metrics. We highlight several overlooked problems in traditional SE pipelines: (1) mismatches between declared and effective audio bandwidths, along with label noise even in various "high-quality" speech corpora; (2) lack of both effective SE systems to conquer the hardest conditions (e.g., speech overlap, strong noise / reverberation) and reliable measure of speech sample difficulty; (3) importance of combining multifaceted metrics for a comprehensive evaluation correlating well with human judgment. We hope that this endeavor can inspire improved SE pipeline designs in the future.
Authors: Karl El Hajal, Enno Hermann, Sevada Hovsepyan, Mathew Magimai.-Doss
Automatic speech recognition (ASR) systems struggle with dysarthric speech due to high inter-speaker variability and slow speaking rates. To address this, we explore dysarthric-to-healthy speech conversion for improved ASR performance. Our approach extends the Rhythm and Voice (RnV) conversion framework by introducing a syllable-based rhythm modeling method suited for dysarthric speech. We assess its impact on ASR by training LF-MMI models and fine-tuning Whisper on converted speech. Experiments on the Torgo corpus reveal that LF-MMI achieves significant word error rate reductions, especially for more severe cases of dysarthria, while fine-tuning Whisper on converted data has minimal effect on its performance. These results highlight the potential of unsupervised rhythm and voice conversion for dysarthric ASR. Code available at: this https URL
Authors: Shi He, Lingsheng Meng, Yao Ge, Yong Liang Guan, David González G., Zilong Liu
This paper focuses on designing Doppler-resilient sequences with low local Ambiguity Function (AF) sidelobes, subject to certain spectral and Peak-to-Average Power Ratio (PAPR) constraints. To achieve this, we propose two distinctoptimization algorithms: (i) an Alternating Minimization (AM) algorithm for superior Weighted Peak Sidelobe Level (WPSL) minimization, and (ii) a low-complexity Augmented Lagrangian-assisted Majorization Minimization (ALaMM) algorithm with effective WPSL suppression. The proposed schemes hold great potential for sequence design in future 6G and integrated sensing and communication applications, supporting robust sensing under spectral coexistence constraints in high-mobility scenarios.
Authors: Toon Van Puyvelde, Mehran Zareh, Chris Develder
In recent years, deep reinforcement learning (DRL) algorithms have gained traction in home energy management systems. However, their adoption by energy management companies remains limited due to the black-box nature of DRL, which fails to provide transparent decision-making feedback. To address this, explainable reinforcement learning (XRL) techniques have emerged, aiming to make DRL decisions more transparent. Among these, soft differential decision tree (DDT) distillation provides a promising approach due to the clear decision rules they are based on, which can be efficiently computed. However, achieving high performance often requires deep, and completely full, trees, which reduces interpretability. To overcome this, we propose a novel asymmetric soft DDT construction method. Unlike traditional soft DDTs, our approach adaptively constructs trees by expanding nodes only when necessary. This improves the efficient use of decision nodes, which require a predetermined depth to construct full symmetric trees, enhancing both interpretability and performance. We demonstrate the potential of asymmetric DDTs to provide transparent, efficient, and high-performing decision-making in home energy management systems.
Authors: Mattson Ogg, Caitlyn Bishop, Han Yi, Sarah Robinson
Methods for automatically assessing speech quality are critical for many human language technologies. Behavioral ratings provided by human raters (e.g., mean opinion scores; MOS) are considered the gold standard, but they are susceptible to variability between individual raters, cannot easily be generalized across corpora, and are labor-intensive to collect, thus limiting the acoustic challenges they can quantify. Here, we present a new, scalable method for automatically assessing speech quality: the self-supervised speech quality assessment (S3QA) model. First, we processed high quality utterances from multiple speech corpora, using a wide range of acoustic manipulations intended to emulate common sources of quality degradation in the real-world: frequency filtering, reverberation, background noise, and digital compression. Second, we leveraged an existing, pre-trained speech foundation model, WavLM, to computationally derive a self-supervised training target for the level of signal degradation by calculating the cosine distances between the clean and degraded versions of each utterance in the embedding space. Next, we trained a transformer-based model to predict the cosine distance, or degradation index, given only the degraded versions of these utterances. Finally, the trained model was evaluated on unseen test corpora of synthetic mixtures, NISQA, and VOiCES. We show that the S3QA model trained on this task performs well and is aligned with both behavioral ratings (MOS), speech technology performance (automatic speech recognition) and other important features of the held-out data (e.g., microphone distances). This approach provides an automated, scalable method for assessing speech quality across a wide range of acoustic challenges, and could easily be adapted to other use cases where acoustic simulations are available.
Authors: A. Hippert-Ferrer, A. Sportisse, A. Javaheri, M. N. El Korso, D. P. Palomar
The goal of this tutorial is to provide an overview of recent methods for handling missing data in signal processing methods, from their origins to the challenges ahead. Missing data approaches are grouped by three main categories: i) missing-data imputation, ii) estimation with missing values and iii) prediction with missing values. We focus on methodological and experimental results through specific case studies on real-world applications. Promising and future research directions, including a better integration of informative missingness, are also discussed. We believe that the proposed conceptual framework and the presentation of the main problems related to missing data will encourage researchers of the signal processing community to develop original methods for handling missing values and to efficiently deal with new applications involving missing data.
Authors: Aikaterini Maria Panteleaki, Varatheepan Paramanayakam, Vasileios Pentsos, Andreas Karatzas, Spyros Tragoudas, Iraklis Anagnostopoulos
The increasing demand for Artificial Intelligence (AI) computing poses significant environmental challenges, with both operational and embodied carbon emissions becoming major contributors. This paper presents a carbon-aware holistic methodology for designing and managing sustainable Edge Data Centers (EDCs), based on three design principles that challenge the state-of-the-art optimization paradigms. Our approach employs vertical integration across the architecture, system, and runtime layers, balances operational and embodied carbon emissions while considering EDC performance as a co-optimization objective, rather than a constraint. At the architecture level, we propose carbon-aware and approximate accelerator designs to reduce embodied carbon. At the system level, we enhance resource utilization and adapt to real-time carbon intensity variations to minimize operational emissions. Finally, at the runtime level, we develop dynamic scheduling frameworks that adjust execution, based on energy constraints and carbon intensity.
Authors: J. Wehbeh, E. C. Kerrigan
Robust optimal or min-max model predictive control (MPC) approaches aim to guarantee constraint satisfaction over a known, bounded uncertainty set while minimizing a worst-case performance bound. Traditionally, these methods compute a trajectory that meets the desired properties over a fixed prediction horizon, apply a portion of the resulting input, and then re-solve the MPC problem using newly obtained measurements at the next time step. However, this approach fails to account for the fact that the control trajectory will be updated in the future, potentially leading to conservative designs. In this paper, we present a novel update-aware robust optimal MPC algorithm for decreasing horizon problems on nonlinear systems that explicitly accounts for future control trajectory updates. This additional insight allows our method to provably expand the feasible solution set and guarantee improved worst-case performance bounds compared to existing techniques. Our approach formulates the trajectory generation problem as a sequence of nested existence-constrained semi-infinite programs (SIPs), which can be efficiently solved using local reduction techniques. To demonstrate its effectiveness, we evaluate our approach on a planar quadrotor problem, where it clearly outperforms an equivalent method that does not account for future updates at the cost of increased computation time.
Authors: Anna Leschanowsky, Kishor Kayyar Lakshminarayana, Anjana Rajasekhar, Lyonel Behringer, Ibrahim Kilinc, Guillaume Fuchs, Emanuël A. P. Habets
Speech intelligibility assessment is essential for evaluating neural speech codecs, yet most evaluation efforts focus on overall quality rather than intelligibility. Only a few publicly available tools exist for conducting standardized intelligibility tests, like the Diagnostic Rhyme Test (DRT) and Modified Rhyme Test (MRT). We introduce the Speech Intelligibility Toolkit for Subjective Evaluation (SITool), a Flask-based web application for conducting DRT and MRT in laboratory and crowdsourcing settings. We use SITool to benchmark 13 neural and traditional speech codecs, analyzing phoneme-level degradations and comparing subjective DRT results with objective intelligibility metrics. Our findings show that, while neural speech codecs can outperform traditional ones in subjective intelligibility, only STOI and ESTOI - not WER - significantly correlate with subjective results, although they struggle to capture gender and wordlist-specific variations observed in subjective evaluations.
Authors: J. Wehbeh, E. C. Kerrigan
In some optimal control problems, complex relationships between states and inputs cannot be easily represented using continuous constraints, necessitating the use of discrete logic instead. This paper presents a method for incorporating such logic constraints directly within continuous optimization frameworks, eliminating the need for binary variables or specialized solvers. Our approach reformulates arbitrary logic constraints under minimal assumptions as max-min constraints, which are then smoothed by introducing auxiliary variables into the optimization problem. When these reformulated constraints are satisfied, they guarantee that the original logical conditions hold, ensuring correctness in the optimization process. We demonstrate the effectiveness of this method on two planar quadrotor control tasks with complex logic constraints. Compared to existing techniques for encoding logic in continuous optimization, our approach achieves faster computational performance and improved convergence to feasible solutions.
Authors: Rania Tafat, Jaime A. Moreno, Stefan Streif
The Generalized Super-Twisting Observer (GSTO) is extended for a strongly observable class of nonlinearly interconnected systems with bounded uncertainties/perturbations. A nonsmooth strong Lyapunov function is used to prove the finite-time convergence of the proposed observer to the true system's trajectories, in the presence of the uncertainties. A case study on the interaction between two food production systems is presented, comparing the proposed observer with the High Gain observer. The results emphasize the critical role of the GSTO's discontinuous term in achieving exact estimation.
Authors: Defne E. Ozan, Andrea Nóvoa, Georgios Rigas, Luca Magri
The control of spatio-temporally chaos is challenging because of high dimensionality and unpredictability. Model-free reinforcement learning (RL) discovers optimal control policies by interacting with the system, typically requiring observations of the full physical this http URL practice, sensors often provide only partial and noisy measurements (observations) of the system. The objective of this paper is to develop a framework that enables the control of chaotic systems with partial and noisy observability. The proposed method, data-assimilated model-informed reinforcement learning (DA-MIRL), integrates (i) low-order models to approximate high-dimensional dynamics; (ii) sequential data assimilation to correct the model prediction when observations become available; and (iii) an off-policy actor-critic RL algorithm to adaptively learn an optimal control strategy based on the corrected state estimates. We test DA-MIRL on the spatiotemporally chaotic solutions of the Kuramoto-Sivashinsky equation. We estimate the full state of the environment with (i) a physics-based model, here, a coarse-grained model; and (ii) a data-driven model, here, the control-aware echo state network, which is proposed in this paper. We show that DA-MIRL successfully estimates and suppresses the chaotic dynamics of the environment in real time from partial observations and approximate models. This work opens opportunities for the control of partially observable chaotic systems.
Authors: Felipe Villenas, Kaiquan Wu, Yunus Can Gültekin, Jamal Riani, Alex Alvarado
We propose a novel 5-bit/2D-symbol modulation format based on PAM-6 optimized for IM-DD systems dominated by relative intensity noise. The proposed modulation scheme improves SNR by 0.94 dB compared to conventional PAM-6 and achieves near-optimal BER performance.
Authors: Jiaxi Sheng, Leyi Yu, Haoyue Li, Yifan Gao, Xin Gao
Evaluating AI-generated medical image segmentations for clinical acceptability poses a significant challenge, as traditional pixelagreement metrics often fail to capture true diagnostic utility. This paper introduces Hierarchical Clinical Reasoner (HCR), a novel framework that leverages Large Language Models (LLMs) as clinical guardrails for reliable, zero-shot quality assessment. HCR employs a structured, multistage prompting strategy that guides LLMs through a detailed reasoning process, encompassing knowledge recall, visual feature analysis, anatomical inference, and clinical synthesis, to evaluate segmentations. We evaluated HCR on a diverse dataset across six medical imaging tasks. Our results show that HCR, utilizing models like Gemini 2.5 Flash, achieved a classification accuracy of 78.12%, performing comparably to, and in instances exceeding, dedicated vision models such as ResNet50 (72.92% accuracy) that were specifically trained for this task. The HCR framework not only provides accurate quality classifications but also generates interpretable, step-by-step reasoning for its assessments. This work demonstrates the potential of LLMs, when appropriately guided, to serve as sophisticated evaluators, offering a pathway towards more trustworthy and clinically-aligned quality control for AI in medical imaging.
Authors: Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe
Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.
Authors: Ian Walter, Jitesh H. Panchal, Philip E. Paré
We propose a hybrid spreading process model to capture the dynamics of demand for software-based products. We introduce discontinuous jumps in the state to model sudden surges in demand that can be seen immediately after a product update is released. After each update, the modeled demand evolves according to a continuous-time susceptible-infected-susceptible (SIS) epidemic model. We identify the necessary and sufficient conditions for estimating the hybrid model's parameters for an arbitrary finite number of sequential updates. We verify the parameter estimation conditions in simulation, and evaluate how the estimation of these parameters is impacted by the presence of observation and process noise. We then validate our model by applying our estimation method to daily user engagement data for a regularly updating software product, the live-service video game `Apex Legends.'
Authors: Xianrui Zheng, Chao Zhang, Philip C. Woodland
This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. DNCASR uses two separate encoders to independently encode global speaker characteristics and local waveform information, along with two linked decoders to generate speaker-attributed transcriptions. The use of linked decoders allows the entire system to be jointly trained under a unified loss function. By employing a serialised training approach, DNCASR effectively addresses overlapping speech in real-world meetings, where the link improves the prediction of speaker indices in overlapping segments. Experiments on the AMI-MDM meeting corpus demonstrate that the jointly trained DNCASR outperforms a parallel system that does not have links between the speaker and ASR decoders. Using cpWER to measure the speaker-attributed word error rate, DNCASR achieves a 9.0% relative reduction on the AMI-MDM Eval set.
Authors: Mushfiqur Rahman, Ismail Guvenc, Jason A. Abrahamson, Amitabh Mishra, Arupjyoti Bhuyan
An Unmanned Aerial Vehicle (UAV)-based communication typically involves a link between a UAV-mounted antenna and a ground station. The radiation pattern of both antennas is influenced by nearby reflecting surfaces and scatterers, such as the UAV body and the ground. Experimentally characterizing the effective radiation patterns of both antennas is challenging, as the received power depends on their interaction. In this study, we learn a combined radiation pattern from experimental UAV flight data, assuming the UAV travels with a fixed orientation (constant yaw angle and zero pitch/roll). We validate the characterized radiation pattern by cross-referencing it with experiments involving different UAV trajectories, all conducted under identical ground station and UAV orientation conditions. Experimental results show that the learned combined radiation pattern reduces received power estimation error by up to 10 dB, compared to traditional anechoic chamber radiation patterns that neglect the effects of the UAV body and surrounding objects.
Authors: Arjun Prasaath Anbazhagan, Parteek Kumar, Ujjwal Kaur, Aslihan Akalin, Kevin Zhu, Sean O'Brien
How does textual representation of audio relate to the Large Language Model's (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary. Further research into techniques that can enhance the quality and diversity of LLM-generated audio can lead to an improvement in the performance of text-based LLMs in generating audio.
Authors: Behtom Adeli, John Mclinden, Pankaj Pandey, Ming Shao, Yalda Shahriari
In recent years, deep learning (DL) approaches have demonstrated promising results in decoding hemodynamic responses captured by functional near-infrared spectroscopy (fNIRS), particularly in the context of brain-computer interface (BCI) applications. This work introduces AbsoluteNet, a novel deep learning architecture designed to classify auditory event-related responses recorded using fNIRS. The proposed network is built upon principles of spatio-temporal convolution and customized activation functions. Our model was compared against several models, namely fNIRSNET, MDNN, DeepConvNet, and ShallowConvNet. The results showed that AbsoluteNet outperforms existing models, reaching 87.0% accuracy, 84.8% sensitivity, and 89.2% specificity in binary classification, surpassing fNIRSNET, the second-best model, by 3.8% in accuracy. These findings underscore the effectiveness of our proposed deep learning model in decoding hemodynamic responses related to auditory processing and highlight the importance of spatio-temporal feature aggregation and customized activation functions to better fit fNIRS dynamics.
Authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo
We introduce ACE-Step, a novel open-source foundation model for music generation that overcomes key limitations of existing approaches and achieves state-of-the-art performance through a holistic architectural design. Current methods face inherent trade-offs between generation speed, musical coherence, and controllability. For example, LLM-based models (e.g. Yue, SongGen) excel at lyric alignment but suffer from slow inference and structural artifacts. Diffusion models (e.g. DiffRhythm), on the other hand, enable faster synthesis but often lack long-range structural coherence. ACE-Step bridges this gap by integrating diffusion-based generation with Sana's Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer. It also leverages MERT and m-hubert to align semantic representations (REPA) during training, allowing rapid convergence. As a result, our model synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU-15x faster than LLM-based baselines-while achieving superior musical coherence and lyric alignment across melody, harmony, and rhythm metrics. Moreover, ACE-Step preserves fine-grained acoustic details, enabling advanced control mechanisms such as voice cloning, lyric editing, remixing, and track generation (e.g. lyric2vocal, singing2accompaniment). Rather than building yet another end-to-end text-to-music pipeline, our vision is to establish a foundation model for music AI: a fast, general-purpose, efficient yet flexible architecture that makes it easy to train subtasks on top of it. This paves the way for the development of powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. In short, our goal is to build a stable diffusion moment for music. The code, the model weights and the demo are available at: this https URL.
Authors: Sujeet Kumar, Pretam Ray, Abhinay Beerukuri, Shrey Kamoji, Manoj Balaji Jagadeeshan, Pawan Goyal
Sanskrit, an ancient language with a rich linguistic heritage, presents unique challenges for automatic speech recognition (ASR) due to its phonemic complexity and the phonetic transformations that occur at word junctures, similar to the connected speech found in natural conversations. Due to these complexities, there has been limited exploration of ASR in Sanskrit, particularly in the context of its poetic verses, which are characterized by intricate prosodic and rhythmic patterns. This gap in research raises the question: How can we develop an effective ASR system for Sanskrit, particularly one that captures the nuanced features of its poetic form? In this study, we introduce Vedavani, the first comprehensive ASR study focused on Sanskrit Vedic poetry. We present a 54-hour Sanskrit ASR dataset, consisting of 30,779 labelled audio samples from the Rig Veda and Atharva Veda. This dataset captures the precise prosodic and rhythmic features that define the language. We also benchmark the dataset on various state-of-the-art multilingual speech models.$^{1}$ Experimentation revealed that IndicWhisper performed the best among the SOTA models.
Authors: Linh Pham
There are few code switching datasets, labeled or unlabled, that exist today. As a result, ASR requires new methods to utilize the vast monolingual data and models that exist. This paper uses OpenAI's open source ASR model, Whisper, which has been pre-trained on 680K hours of audio to perform monolingual ASR tasks. In Part 1, this paper examines how exploiting Whisper's monolingual ability to individually tokenize training text, called "Switching Tokenizers Method", improves transcription accuracy. In Part 2, we combine the Switching Tokenizers Method from part 1 and train a GELU based adapter on the encoder. These two methods reduced Total Mixed Error Rate (MER) to 9.4% for the ASCEND dataset, 6% for SEAME devman and 9.7% for SEAME devsge, outperforming current SoTA methods.
Authors: Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe
The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.
Authors: Mustafa Chasmai, Alexander Shepard, Subhransu Maji, Grant Van Horn
We present the iNaturalist Sounds Dataset (iNatSounds), a collection of 230,000 audio files capturing sounds from over 5,500 species, contributed by more than 27,000 recordists worldwide. The dataset encompasses sounds from birds, mammals, insects, reptiles, and amphibians, with audio and species labels derived from observations submitted to iNaturalist, a global citizen science platform. Each recording in the dataset varies in length and includes a single species annotation. We benchmark multiple backbone architectures, comparing multiclass classification objectives with multilabel objectives. Despite weak labeling, we demonstrate that iNatSounds serves as a useful pretraining resource by benchmarking it on strongly labeled downstream evaluation datasets. The dataset is available as a single, freely accessible archive, promoting accessibility and research in this important domain. We envision models trained on this data powering next-generation public engagement applications, and assisting biologists, ecologists, and land use managers in processing large audio collections, thereby contributing to the understanding of species compositions in diverse soundscapes.
Authors: Xueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu, Helen Meng
Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of speech reconstruction. Our model comprises: (i) a speech content encoder for phoneme embedding restoration via pre-trained self-supervised learning (SSL) speech foundation models; (ii) a speaker identity encoder for speaker-aware identity preservation by in-context learning mechanism; (iii) a diffusion-based speech generator to reconstruct the speech based on the restored phoneme embedding and preserved speaker identity. Through evaluations on the widely-used UASpeech corpus, our proposed model shows notable enhancements in speech intelligibility and speaker similarity.
Authors: Sarthak Kumar Maharana, Saksham Singh Kushwaha, Baoming Zhang, Adrian Rodriguez, Songtao Wei, Yapeng Tian, Yunhui Guo
While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{this https URL}{here}$.
Authors: Ngoc Tuyen Do, Tri Nhu Do
In the surveillance and defense domain, multi-target detection and classification (MTD) is considered essential yet challenging due to heterogeneous inputs from diverse data sources and the computational complexity of algorithms designed for resource-constrained embedded devices, particularly for Al-based solutions. To address these challenges, we propose a feature fusion and knowledge-distilled framework for multi-modal MTD that leverages data fusion to enhance accuracy and employs knowledge distillation for improved domain adaptation. Specifically, our approach utilizes both RGB and thermal image inputs within a novel fusion-based multi-modal model, coupled with a distillation training pipeline. We formulate the problem as a posterior probability optimization task, which is solved through a multi-stage training pipeline supported by a composite loss function. This loss function effectively transfers knowledge from a teacher model to a student model. Experimental results demonstrate that our student model achieves approximately 95% of the teacher model's mean Average Precision while reducing inference time by approximately 50%, underscoring its suitability for practical MTD deployment scenarios.
Authors: Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li
Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additionally, relying solely on a classification loss makes it difficult to capture the intrinsic differences between real and fake audio. In this paper, we propose the RPRA-ADD, an integrated Reconstruction-Perception-Reinforcement-Attention networks based forgery trace enhancement-driven robust audio deepfake detection framework. First, we propose a Global-Local Forgery Perception (GLFP) module for enhancing the acoustic perception capacity of forgery traces. To significantly reinforce the feature space distribution differences between real and fake audio, the Multi-stage Dispersed Enhancement Loss (MDEL) is designed, which implements a dispersal strategy in multi-stage feature spaces. Furthermore, in order to enhance feature awareness towards forgery traces, the Fake Trace Focused Attention (FTFA) mechanism is introduced to adjust attention weights dynamically according to the reconstruction discrepancy matrix. Visualization experiments not only demonstrate that FTFA improves attention to voice segments, but also enhance the generalization capability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on 4 benchmark datasets, including ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound, achieving over 20% performance improvement. In addition, it outperforms existing methods in rigorous 3*3 cross-domain evaluations across Speech, Sound, and Singing, demonstrating strong generalization capability across diverse audio domains.
Authors: Siavash Shams, Richard Antonello, Gavin Mischler, Stephan Bickel, Ashesh Mehta, Nima Mesgarani
Decoding continuous language from neural signals remains a significant challenge in the intersection of neuroscience and artificial intelligence. We introduce Neuro2Semantic, a novel framework that reconstructs the semantic content of perceived speech from intracranial EEG (iEEG) recordings. Our approach consists of two phases: first, an LSTM-based adapter aligns neural signals with pre-trained text embeddings; second, a corrector module generates continuous, natural text directly from these aligned embeddings. This flexible method overcomes the limitations of previous decoding approaches and enables unconstrained text generation. Neuro2Semantic achieves strong performance with as little as 30 minutes of neural data, outperforming a recent state-of-the-art method in low-data settings. These results highlight the potential for practical applications in brain-computer interfaces and neural decoding technologies.
Authors: Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Neural audio codecs have made significant strides in efficiently mapping raw audio waveforms into discrete token representations, which are foundational for contemporary audio generative models. However, most existing codecs are optimized primarily for reconstruction quality, often at the expense of the downstream modelability of the encoded tokens. Motivated by the need to overcome this bottleneck, we introduce $\textbf{MagiCodec}$, a novel single-layer, streaming Transformer-based audio codec. MagiCodec is designed with a multistage training pipeline that incorporates Gaussian noise injection and latent regularization, explicitly targeting the enhancement of semantic expressiveness in the generated codes while preserving high reconstruction fidelity. We analytically derive the effect of noise injection in the frequency domain, demonstrating its efficacy in attenuating high-frequency components and fostering robust tokenization. Extensive experimental evaluations show that MagiCodec surpasses state-of-the-art codecs in both reconstruction quality and downstream tasks. Notably, the tokens produced by MagiCodec exhibit Zipf-like distributions, as observed in natural languages, thereby improving compatibility with language-model-based generative architectures. The code and pre-trained models are available at this https URL.
Authors: Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen
Children's automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies-such as age affecting ASR accuracy both directly and indirectly via pronunciation skills. In this paper, we introduce a causal structure discovery to unravel these interdependent relationships among physiology, cognition, extrinsic factors, and ASR errors. Then, we employ causal quantification to measure each factor's impact on children's ASR. We extend the analysis to fine-tuned models to identify which factors are mitigated by fine-tuning and which remain largely unaffected. Experiments on Whisper and Wav2Vec2.0 demonstrate the generalizability of our findings across different ASR systems.
Authors: Luigi Sigillo, Shengfeng He, Danilo Comminiello
High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.
Authors: Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at this https URL.
Authors: Diogo Landau, Ingeborg de Pater, Mihaela Mitici, Nishant Saurabh
Complex systems such as aircraft engines are continuously monitored by sensors. In predictive aircraft maintenance, the collected sensor measurements are used to estimate the health condition and the Remaining Useful Life (RUL) of such systems. However, a major challenge when developing prognostics is the limited number of run-to-failure data samples. This challenge could be overcome if multiple airlines would share their run-to-failure data samples such that sufficient learning can be achieved. Due to privacy concerns, however, airlines are reluctant to share their data in a centralized setting. In this paper, a collaborative federated learning framework is therefore developed instead. Here, several airlines cooperate to train a collective RUL prognostic machine learning model, without the need to centrally share their data. For this, a decentralized validation procedure is proposed to validate the prognostics model without sharing any data. Moreover, sensor data is often noisy and of low quality. This paper therefore proposes four novel methods to aggregate the parameters of the global prognostic model. These methods enhance the robustness of the FL framework against noisy data. The proposed framework is illustrated for training a collaborative RUL prognostic model for aircraft engines, using the N-CMAPSS dataset. Here, six airlines are considered, that collaborate in the FL framework to train a collective RUL prognostic model for their aircraft's engines. When comparing the proposed FL framework with the case where each airline independently develops their own prognostic model, the results show that FL leads to more accurate RUL prognostics for five out of the six airlines. Moreover, the novel robust aggregation methods render the FL framework robust to noisy data samples.
Authors: Yang Zhang, Karteekeya Sastry, Iyla Rossi, Joshua Olick-Gibson, Jonathan J. Russin, Charles Y. Liu, Lihong V. Wang
Noninvasive imaging deep into the adult brain at submillimeter and millisecond scales remains a challenge in medical imaging. Here, we report a helmet based ultrasound brain imager built from a customized helmet, a scanned ultrasound array, and three dimensional printing for real time imaging of human brain anatomical and functional information. Through its application to post hemicraniectomy patients in a sitting position, we achieved volumetric brain tissue structural, vascular, and blood flow images at centimeter scale depths with submillimeter and millisecond spatiotemporal resolutions. We also demonstrated the system capability to track cerebral blood flow over repeated imaging sessions, including during motion prone conditions. Our brain imager circumvents the skull and bridges the gap between high resolution human brain imaging and wearable convenience. This imager may serve as a platform for further investigations into human brain dynamics in post hemicraniectomy patients and offer insights into the brain that could surpass those obtained from non human primate studies.
Authors: Zakir Hussain Shaik, Sai Subramanyam Thoota, Emil Björnson, Erik G. Larsson
We propose a novel resource-efficient over-the-air(OTA) computation framework to address the huge fronthaul computational and control overhead requirements in cell-free massive multiple-input multiple-output (MIMO) networks. We show that the global sufficient statistics to decode the data symbols can be computed OTA using the locally available information at the access points (APs). We provide the essential signal processing aspects at the APs and the central processing unit (CPU) to facilitate the OTA computation of sufficient statistics. The proposed framework scales effectively with an increase in the number of APs. We also make a comprehensive study of the benefits of an OTA framework compared to a conventional digital fronthaul in terms of the overhead associated in transferring the sufficient statistics from the APs to the CPU. To evaluate the performance of the OTA framework, we give closed-form expressions for the mean-square error (MSE)of the estimators of sufficient statistics and the overall data estimator. Furthermore, we assess the symbol error rate (SER)and bit error rate (BER) of the user equipment (UEs) data to demonstrate the efficacy of our method, and benchmark them against the state-of-the-art wired fronthaul networks.
Authors: Dimitrios Bralios, Paris Smaragdis, Jonah Casebeer
Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder's latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo up-mixing, we demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.
Authors: Anahita Jain, Husni Idris, John-Paul Clarke, Daniel Delahaye
We present an adaptive control scheme to enable the emergence of order within distributed, autonomous multi-agent systems. Past studies showed that under high-density conditions, order generated from traffic-following behavior reduces travel times, while under low densities, choosing direct paths is more beneficial. In this paper, we leveraged those findings to allow aircraft to independently and dynamically adjust their degree of traffic-following behavior based on the current state of the airspace. This enables aircraft to follow other traffic only when beneficial. Quantitative analyses revealed that dynamic traffic-following behavior results in lower aircraft travel times at the cost of minimal levels of additional disorder to the airspace. The sensitivity of these benefits to temporal and spatial horizons was also investigated. Overall, this work highlights the benefits, and potential necessity, of incorporating self-organizing behavior in making distributed, autonomous multi-agent systems scalable.
Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
Authors: Harveen Singh Chadha, Aswin Shanmugam Subramanian, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li
In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.
Authors: Jianglin Ding, Jingcheng Tang, Gangshan Jing
Action-dependent individual policies, which incorporate both environmental states and the actions of other agents in decision-making, have emerged as a promising paradigm for achieving global optimality in multi-agent reinforcement learning (MARL). However, the existing literature often adopts auto-regressive action-dependent policies, where each agent's policy depends on the actions of all preceding agents. This formulation incurs substantial computational complexity as the number of agents increases, thereby limiting scalability. In this work, we consider a more generalized class of action-dependent policies, which do not necessarily follow the auto-regressive form. We propose to use the `action dependency graph (ADG)' to model the inter-agent action dependencies. Within the context of MARL problems structured by coordination graphs, we prove that an action-dependent policy with a sparse ADG can achieve global optimality, provided the ADG satisfies specific conditions specified by the coordination graph. Building on this theoretical foundation, we develop a tabular policy iteration algorithm with guaranteed global optimality. Furthermore, we integrate our framework into several SOTA algorithms and conduct experiments in complex environments. The empirical results affirm the robustness and applicability of our approach in more general scenarios, underscoring its potential for broader MARL challenges.
Authors: Yun-Feng Lo, Changmin Lee, Chan-Byoung Chae
Molecular communication (MC), one of the emerging techniques in the field of communication, is entering a new phase following several decades of foundational research. Recently, attention has shifted toward MC in liquid media, particularly within tubular environments, due to novel application scenarios. The spatial constraints of such environments make accurate modeling of molecular movement in tubes more challenging than in traditional free-space channels. In this paper, we propose a three-dimensional channel model for molecular communications with an absorbing ring-shaped receiver in a tubular environment. To the best of our knowledge, this is the first theoretical study to model the impact of an absorbing ring-shaped receiver on the channel response in tube-based MC systems. The problem is formulated as a partial differential equation with heterogeneous boundary conditions, and an approximate solution is derived under flow-dominated conditions. The accuracy of the proposed model is validated through particle-based simulations. We anticipate that the results of this study will contribute to the design of practical MC systems in real-world tubular environments.
Authors: Nabarun Goswami, Tatsuya Harada
We propose a multi-stage framework for universal speech enhancement, designed for the Interspeech 2025 URGENT Challenge. Our system first employs a Sparse Compression Network to robustly separate sources and extract an initial clean speech estimate from noisy inputs. This is followed by an efficient generative model that refines speech quality by leveraging self-supervised features and optimizing a masked language modeling objective on acoustic tokens derived from a neural audio codec. In the final stage, a fusion network integrates the outputs of the first two stages with the original noisy signal, achieving a balanced improvement in both signal fidelity and perceptual quality. Additionally, a shift trick that aggregates multiple time-shifted predictions, along with output blending, further boosts performance. Experimental results on challenging multilingual datasets with variable sampling rates and diverse distortion types validate the effectiveness of our approach.
Authors: Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi
Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.
Authors: Jiali Cheng, Hadi Amiri
We introduce machine unlearning for speech tasks, a novel and underexplored research problem that aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining. This has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation. While machine unlearning has been studied in computer vision and natural language processing, its application to speech is largely unexplored due to the high-dimensional, sequential, and speaker-dependent nature of speech data. We define two fundamental speech unlearning tasks: sample unlearning, which removes individual data points (e.g., a voice recording), and class unlearning, which removes an entire category (e.g., all data from a speaker), while preserving performance on the remaining data. Experiments on keyword spotting and speaker identification demonstrate that unlearning speech data is significantly more challenging than unlearning image or text data. We conclude with key future directions in this area, including structured training, robust evaluation, feature-level unlearning, broader applications, scalable methods, and adversarial robustness.
Authors: Dena Mujtaba, Nihar Mahapatra
Stuttering -- characterized by involuntary disfluencies such as blocks, prolongations, and repetitions -- is often misinterpreted by automatic speech recognition (ASR) systems, resulting in elevated word error rates and making voice-driven technologies inaccessible to people who stutter. The variability of disfluencies across speakers and contexts further complicates ASR training, compounded by limited annotated stuttered speech data. In this paper, we investigate fine-tuning ASRs for stuttered speech, comparing generalized models (trained across multiple speakers) to personalized models tailored to individual speech characteristics. Using a diverse range of voice-AI scenarios, including virtual assistants and video interviews, we evaluate how personalization affects transcription accuracy. Our findings show that personalized ASRs significantly reduce word error rates, especially in spontaneous speech, highlighting the potential of tailored models for more inclusive voice technologies.
Authors: Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, Sheng Zhao
Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.
Authors: Tianrui Pan, Jie Liu, Zewen Huang, Jie Tang, Gangshan Wu
To enhance immersive experiences, binaural audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of premium and large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, augmented by flipped-channel audio. It outperforms existing methods on both simulated and real-recorded datasets, demonstrating superior generalization and accuracy. Besides, we develop an assessment model based on Llama-3.1-8B, which evaluates the spatial semantic coherence between our generated binaural audio and text prompts through a spatial reasoning task. Results demonstrate that text prompts provide flexible and interactive control to generate binaural audio with excellent quality and semantic consistency in spatial locations. Dataset is available at \href{this https URL}
Authors: Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden
While audio foundation models perform well on myriad of tasks from sound classification to speech analysis, these models are trained and tested on dry, non-spatial, single-source audio clips. This limits their success in real-world situations and results in spatially unaware audio embeddings. To address these limitations, we propose a novel self-supervised training approach for General-Purpose, Real-world Audio Models (GRAMs). The GRAM training approach enables robust spatial audio representation learning for naturalistic, noisy sound scenes and can be applied to any masking-based deep learning model. We demonstrate the success of our approach by training two state-of-the-art models, one with a transformer and one with a mamba backbone. We assess the quality of the extracted audio representations from GRAMs using the original version of the HEAR benchmark, a newly synthesized, naturalistic version of the HEAR benchmark, and novel sound localization tasks based on HEAR benchmark datasets. The results show that our approach minimizes the performance gap between dry, non-spatial, single-source sound scenes and naturalistic sound scenes for crucial tasks such as auditory scene analysis, outperforming existing state-of-the-art audio foundation models at a fraction of the training steps. Moreover, GRAMs show state-of-the-art performance on sound localization tasks, exceeding even supervised sound localization models. In sum, the proposed approach represents a significant advancement towards robust audio foundation models for real-world applications with state-of-the-art performance on naturalistic sound scenes as well as spatial audio representation learning.
Authors: Haitao Li, Ziyu Li, Yiheng Mao, Ziyi Liu, Zhoujian Sun, Zhengxing Huang
The advent of multimodal large language models (MLLMs) has sparked interest in their application to electrocardiogram (ECG) analysis. However, existing ECG-focused MLLMs primarily focus on report generation tasks, often limited to single 12-lead, short-duration (10s) ECG inputs, thereby underutilizing the potential of MLLMs. To this end, we aim to develop a MLLM for ECG analysis that supports a broader range of tasks and more flexible ECG inputs. However, existing ECG-QA datasets are often monotonous. To address this gap, we first constructed the anyECG dataset, which encompasses a wide variety of tasks, including report generation, abnormal waveform localization, and open-ended question answering. In addition to standard hospital ECGs, we introduced long-duration reduced-lead ECGs for home environments and multiple ECG comparison scenarios commonly encountered in clinical practice. Furthermore, we propose the anyECG-chat model, which supports dynamic-length ECG inputs and multiple ECG inputs. We trained the model using a three-stage curriculum training recipe with the anyECG dataset. A comprehensive evaluation was conducted, demonstrating that anyECG-chat is capable of supporting various practical application scenarios, including not only common report generation tasks but also abnormal waveform localization for long-duration reduced-lead ECGs in home environments and comprehensive comparative analysis of multiple ECGs.
Authors: Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao
Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.
Authors: Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
Authors: Ming Meng, Ziyi Yang, Jian Yang, Zhenjie Su, Yonggui Zhu, Zhaoxin Fan
Recent advancements in text-to-speech (TTS) technology have increased demand for personalized audio synthesis. Zero-shot voice cloning, a specialized TTS task, aims to synthesize a target speaker's voice using only a single audio sample and arbitrary text, without prior exposure to the speaker during training. This process employs pattern recognition techniques to analyze and replicate the speaker's unique vocal features. Despite progress, challenges remain in adapting to the vocal style of unseen speakers, highlighting difficulties in generalizing TTS systems to handle diverse voices while maintaining naturalness, expressiveness, and speaker fidelity. To address the challenges of unseen speaker style adaptation, we propose DS-TTS, a novel approach aimed at enhancing the synthesis of diverse, previously unheard voices. Central to our method is a Dual-Style Encoding Network (DuSEN), where two distinct style encoders capture complementary aspects of a speaker's vocal identity. These speaker-specific style vectors are seamlessly integrated into the Dynamic Generator Network (DyGN) via a Style Gating-Film (SGF) mechanism, enabling more accurate and expressive reproduction of unseen speakers' unique vocal characteristics. In addition, we introduce a Dynamic Generator Network to tackle synthesis issues that arise with varying sentence lengths. By dynamically adapting to the length of the input, this component ensures robust performance across diverse text inputs and speaker styles, significantly improving the model's ability to generalize to unseen speakers in a more natural and expressive manner. Experimental evaluations on the VCTK dataset suggest that DS-TTS demonstrates superior overall performance in voice cloning tasks compared to existing state-of-the-art models, showing notable improvements in both word error rate and speaker similarity.
Authors: Shenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li
This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources.
Authors: Harsha Yelchuri, Diwakar Kumar Singh, Nithish Krishnabharathi Gnani, T V Prabhakar, Chandramani Singh
Robotic surgery imposes a significant cognitive burden on the surgeon. This cognitive burden increases in the case of remote robotic surgeries due to latency between entities and thus might affect the quality of surgery. Here, the patient side and the surgeon side are geographically separated by hundreds to thousands of kilometres. Real-time teleoperation of robots requires strict latency bounds for control and feedback. We propose a dual digital twin (DT) framework and explain the simulation environment and teleoperation framework. Here, the doctor visually controls the locally available DT of the patient side and thus experiences minimum latency. The second digital twin serves two purposes. Firstly, it provides a layer of safety for operator-related mishaps, and secondly, it conveys the coordinates of known and unknown objects back to the operator's side digital twin. We show that teleoperation accuracy and user experience are enhanced with our approach. Experimental results using the NASA-TLX metric show that the quality of surgery is vastly improved with DT, perhaps due to reduced cognitive burden. The network data rate for identifying objects at the operator side is 25x lower than normal.
Authors: Lorenzo Lagostina, Deborah Volpe, Maurizio Zamboni, Giovanna Turvani
This work presents AEQUAM (Area Efficient QUAntum eMulation), a toolchain that enables faster and more accessible quantum circuit verification. It consists of a compiler that translates OpenQASM 2.0 into RISC-like instructions, Cython software models for selecting number representations and simulating circuits, and a VHDL generator that produces RTL descriptions for FPGA-based hardware emulators. The architecture leverages a SIMD approach to parallelize computation and reduces complexity by exploiting the sparsity of quantum gate matrices. The VHDL generator allows customization of the number of emulated qubits and parallelization levels to meet user requirements. Synthesized on an Altera Cyclone 10LP FPGA with a 20-bit fixed-point representation and nearest-type approximation, the architecture demonstrates better scalability than other state-of-the-art emulators. Specifically, the emulator has been validated by exploiting the well consolidated benchmark of mqt bench framework.
Authors: Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.
Authors: Zheng Zhao
Sequential Monte Carlo (SMC) methods have recently shown successful results for conditional sampling of generative diffusion models. In this paper we propose a new diffusion posterior SMC sampler achieving improved statistical efficiencies, particularly under outlier conditions or highly informative likelihoods. The key idea is to construct an observation path that correlates with the diffusion model and to design the sampler to leverage this correlation for more efficient sampling. Empirical results conclude the efficiency.
Authors: Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang
High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in this https URL.
Authors: Anna Seo Gyeong Choi, Alexander Richardson, Ryan Partlan, Sunny Tang, Sunghye Cho
This study compares three acoustic feature extraction toolkits (OpenSMILE, Praat, and Librosa) applied to clinical speech data from individuals with schizophrenia spectrum disorders (SSD) and healthy controls (HC). By standardizing extraction parameters across the toolkits, we analyzed speech samples from 77 SSD and 87 HC participants and found significant toolkit-dependent variations. While F0 percentiles showed high cross-toolkit correlation (r=0.962 to 0.999), measures like F0 standard deviation and formant values often had poor, even negative, agreement. Additionally, correlation patterns differed between SSD and HC groups. Classification analysis identified F0 mean, HNR, and MFCC1 (AUC greater than 0.70) as promising discriminators. These findings underscore reproducibility concerns and advocate for standardized protocols, multi-toolkit cross-validation, and transparent reporting.
Authors: Asım Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani
The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.
Authors: Nhan Phan, Mikko Kuronen, Maria Kautonen, Riikka Ullakonoja, Anna von Zansen, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo
Mispronunciation detection (MD) models are the cornerstones of many language learning applications. Unfortunately, most systems are built for English and other major languages, while low-resourced language varieties, such as Finland Swedish (FS), lack such tools. In this paper, we introduce our MD model for FS, trained on 89 hours of first language (L1) speakers' spontaneous speech and tested on 33 minutes of L2 transcribed read-aloud speech. We trained a multilingual wav2vec 2.0 model with entropy regularization, followed by temperature scaling and top-k normalization after the inference to better adapt it for MD. The main novelty of our method lies in its simplicity, requiring minimal L2 data. The process is also language-independent, making it suitable for other low-resource languages. Our proposed algorithm allows us to balance Recall (43.2%) and Precision (29.8%), compared with the baseline model's Recall (77.5%) and Precision (17.6%).
Authors: Bryan Van Scoy, Laurent Lessard
We consider iterative gradient-based optimization algorithms applied to functions that are smooth and strongly convex. The fastest globally convergent algorithm for this class of functions is the Triple Momentum (TM) method. We show that if the objective function is also twice continuously differentiable, a new, faster algorithm emerges, which we call $C^2$-Momentum (C2M). We prove that C2M is globally convergent and that its worst-case convergence rate is strictly faster than that of TM, with no additional computational cost. We validate our theoretical findings with numerical examples, demonstrating that C2M outperforms TM when the objective function is twice continuously differentiable.
Authors: Will James
This project is the first of several experiments composing music that changes in response to biosignals. The system is dubbed "iola walker" in reference to a common polyrhythm, the hemiola. A listener goes for a walk, and the Iola Walker app detects their walking pace. Iola Walker picks up footfalls using a foot-mounted accelerometer, processing the signals in real time using a recurrent neural network in an Android app. The Android app outputs a MIDI event for each footfall. The iola walker player, which might be a VST running in a DAW, plays the version of the next music passage with underlying polyrhythms closest to the listener's walking pace. This paper documents the process of training the model to detect the footfalls in real time. The model is trained on accelerometer data from an Mbient Labs foot-mounted IMU at 200~Hz, with the ground truth for footfalls annotated by pressing the volume-up button on the Android device when the foot hits the ground. To collect training data, I walked around my neighborhood clicking the volume-up button each time my foot hit the ground. Several methods were tried for detecting footfalls in real time from sensor data, including ones based on digital signal processing techniques and traditional machine learning techniques.
Authors: Ning Zhang, Henry Kenlay, Li Zhang, Mihai Cucuringu, Xiaowen Dong
Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.
Authors: Yu Nakagome, Michael Hentschel
Despite recent advances in end-to-end speech recognition methods, the output tends to be biased to the training data's vocabulary, resulting in inaccurate recognition of proper nouns and other unknown terms. To address this issue, we propose a method to improve recognition accuracy of such rare words in CTC-based models without additional training or text-to-speech systems. Specifically, keyword spotting is performed using acoustic features of intermediate layers during inference, and a bias is applied to the subsequent layers of the acoustic model for detected keywords. For keyword detection, we adopt a wildcard CTC that is both fast and tolerant of ambiguous matches, allowing flexible handling of words that are difficult to match strictly. Since this method does not require retraining of existing models, it can be easily applied to even large-scale models. In experiments on Japanese speech recognition, the proposed method achieved a 29% improvement in the F1 score for unknown words.
Authors: Xingjian Diao, Tianzhen Yang, Chunhui Zhang, Weiyi Wu, Ming Cheng, Jiang Gui
Music performances, characterized by dense and continuous audio as well as seamless audio-visual integration, present unique challenges for multimodal scene understanding and reasoning. Recent Music Performance Audio-Visual Question Answering (Music AVQA) datasets have been proposed to reflect these challenges, highlighting the continued need for more effective integration of audio-visual representations in complex question answering. However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA. It integrates three sparsification strategies into an end-to-end pipeline and achieves state-of-the-art performance on the Music AVQA datasets. In addition, it reduces training time by 28.32% compared to its fully trained dense counterpart while maintaining accuracy, demonstrating clear efficiency gains. To further improve data efficiency, we propose a key-subset selection algorithm that selects and uses approximately 25% of MUSIC-AVQA v2.0 training data and retains 70-80% of full-data performance across models.
Authors: Thi Vu, Linh The Nguyen, Dat Quoc Nguyen
This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.
Authors: Haoyu Li, Xiangru Zhong, Bin Hu, Huan Zhang
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimations of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their framework. In this work, we propose a novel two-stage training framework to jointly synthesize the controller and Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training data sampling strategy and a domain updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,\!\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10000$ times faster compared to the traditional SMT solver dReal. Our code is available at this https URL.
Authors: Xue Xian Zheng, Weihang Liu, Xin Lou, Stefan Vlaski, Tareq Al-Naffouri
This paper introduces an innovative error feedback framework designed to mitigate quantization noise in distributed graph filtering, where communications are constrained to quantized messages. It comes from error spectrum shaping techniques from state-space digital filters, and therefore establishes connections between quantized filtering processes over different domains. In contrast to existing error compensation methods, our framework quantitatively feeds back the quantization noise for exact compensation. We examine the framework under three key scenarios: (i) deterministic graph filtering, (ii) graph filtering over random graphs, and (iii) graph filtering with random node-asynchronous updates. Rigorous theoretical analysis demonstrates that the proposed framework significantly reduces the effect of quantization noise, and we provide closed-form solutions for the optimal error feedback coefficients. Moreover, this quantitative error feedback mechanism can be seamlessly integrated into communication-efficient decentralized optimization frameworks, enabling lower error floors. Numerical experiments validate the theoretical results, consistently showing that our method outperforms conventional quantization strategies in terms of both accuracy and robustness.
Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Satoshi Asakawa
This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.
Authors: Yu-Fei Shi, Yang Ai, Zhen-Hua Ling
To compare the performance of two speech generation sys- tems, one of the most effective approaches is estimating the preference score between their generated speech. This pa- per proposes a novel universal preference-score-based pairwise speech quality assessment (UPPSQA) model, aimed at predict- ing the preference score between paired speech samples to de- termine which one has better quality. The model first predicts the absolute mean opinion score (MOS) for the two speech sam- ples separately, and then aggregates them into a relative prefer- ence score using a preference function. To address the scarcity of preference data, we also construct a new pairwise speech dataset based on a MOS dataset for experiments. Experimental results confirm that, whether in training scenarios with differ- ent data types and label conditions, or in both in-domain and out-of-domain test scenarios, the prediction accuracy of UPP- SQA outperforms that of the baseline models, demonstrating its universality.
Authors: Tanel Alumäe, Artem Fedorchenko
This paper describes the language identification and multilingual speech recognition system developed at Tallinn University of Technology for the Interspeech 2025 ML-SUPERB 2.0 Challenge. A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages and language-specific bigram language models. For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data. The model set consists of a finetuned version of SeamlessM4T, MMS-1B-all with custom language adapters and MMS-zeroshot. The system obtained the top overall score in the challenge.
Authors: Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee
Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schrödinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and require a large number of sampling steps -- more than 50. Our investigation reveals that the performance of baseline models significantly degrades when the number of sampling steps is reduced, particularly under low-SNR conditions. We propose integrating Schrödinger Bridge with GANs to effectively mitigate this issue, achieving high-quality outputs on full-band datasets while substantially reducing the required sampling steps. Experimental results demonstrate that our proposed model outperforms existing baselines, even with a single inference step, in both denoising and dereverberation tasks.
Authors: Jose Manoel Balthazar, Jorge Luis Palacios Felix, Mauricio A. Ribeiro, Angelo Marcelo Tusset, Jeferson Jose de Lima, Vinicius Piccirillo, Julijana Simonovic, Nikola D. Nevsic, Marcos Varanis, Clivaldo de Oliveira, Raphaela C. Machado, Gabriella O M Oliveira
In this paper, we discuss an example of current importance with a future perspective in engineering, in which excitation sources always have limited power, limited inertia, and their frequencies vary according to the instantaneous state of the vibrating system. Practical examples of non-ideal systems are considered. The most common phenomenon for this kind of system is discussed. The period considered is from 2020 to 2025. The specific properties of various models are also discussed. Directions for future investigations are provided. In this paper, the authors revisited some publications based on the assumption that the external excitations are produced by non-ideal sources (RNIS), that is, with limited power supply. Among these applications, nonlinear phenomena such as the Sommerfeld effect and saturation phenomenon were observed, considering fractional damping. Energy harvesters and the Jacobi-Anger expansion were used in the governing equations of motion. We also used the Jacobi-Anger expansion in the case of energy transfer between vibrating modes under an external force with time-varying frequency, which represents one of the future directions of research on non-ideal vibrating systems (RNIS).
Authors: Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang
Stage lighting plays an essential role in live music performances, influencing the engaging experience of both musicians and audiences. Given the high costs associated with hiring or training professional lighting engineers, Automatic Stage Lighting Control (ASLC) has gained increasing attention. However, most existing approaches only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this issue, this paper presents an end-to-end solution that directly learns from experienced lighting engineers -- Skip-BART. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method modifies the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame this http URL validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting this http URL, our method yields a p-value of 0.72 in a statistical comparison based on human evaluations with human lighting engineers, suggesting that the proposed approach closely matches human lighting engineering performance. To support further research, we have made our self-collected dataset, code, and trained model parameters available at this https URL .
Authors: Guitao Wang, Jinming Zhao, Hao Yang, Guilin Qi, Tongtong Wu, Gholamreza Haffari
Rapid growth in speech data demands adaptive models, as traditional static methods fail to keep pace with dynamic and diverse speech information. We introduce continuous speech learning, a new set-up targeting at bridging the adaptation gap in current speech models. We use the encoder-decoder Whisper model to standardize speech tasks into a generative format. We integrate a learnable gated-fusion layer on the top of the encoder to dynamically select task-specific features for downstream tasks. Our approach improves accuracy significantly over traditional methods in six speech processing tasks, demonstrating gains in adapting to new speech tasks without full retraining.
Authors: Satvik Dixit, Sungjoon Park, Chris Donahue, Laurie M. Heller
Temporal envelope morphing, the process of interpolating between the amplitude dynamics of two audio signals, is an emerging problem in generative audio systems that lacks sufficient perceptual grounding. Morphing of temporal envelopes in a perceptually intuitive manner should enable new methods for sound blending in creative media and for probing perceptual organization in psychoacoustics. However, existing audio morphing techniques often fail to produce intermediate temporal envelopes when input sounds have distinct temporal structures; many morphers effectively overlay both temporal structures, leading to perceptually unnatural results. In this paper, we introduce a novel workflow for learning envelope morphing with perceptual guidance: we first derive perceptually grounded morphing principles through human listening studies, then synthesize large-scale datasets encoding these principles, and finally train machine learning models to create perceptually intermediate morphs. Specifically, we present: (1) perceptual principles that guide envelope morphing, derived from our listening studies, (2) a supervised framework to learn these principles, (3) an autoencoder that learns to compress temporal envelope structures into latent representations, and (4) benchmarks for evaluating audio envelope morphs, using both synthetic and naturalistic data, and show that our approach outperforms existing methods in producing temporally intermediate morphs. All code, models, and datasets will be made publicly available upon publication.
Authors: Hemanth Sabbella, Archit Mukherjee, Thivya Kandappu, Sounak Dey, Arpan Pal, Archan Misra, Dong Ma
Spiking neural networks (SNNs) have emerged as a class of bio -inspired networks that leverage sparse, event- driven signaling to achieve low-power computation while inherently modeling temporal dynamics. Such characteristics align closely with the demands of ubiquitous computing systems, which often operate on resource- constrained devices while continuously monitoring and processing time - series sensor data. Despite their unique and promising features, SNNs have received limited attention and remain underexplored (or at least, under-adopted) within the ubiquitous computing community. To address this gap, this paper first introduces the core components of SNNs, both in terms of models and training mechanisms. It then presents a systematic survey of 76 SNN-based studies focused on time-series data analysis, categorizing them into six key application domains. For each domain, we summarize relevant works and subsequent advancements, distill core insights, and highlight key takeaways for researchers and practitioners. To facilitate hands-on experimentation, we also provide a comprehensive review of current software frameworks and neuromorphic hardware platforms, detailing their capabilities and specifications, and then offering tailored recommendations for selecting development tools based on specific application needs. Finally, we identify prevailing challenges within each application domain and propose future research directions that need be explored in ubiquitous community. Our survey highlights the transformative potential of SNNs in enabling energy-efficient ubiquitous sensing across diverse application domains, while also serving as an essential introduction for researchers looking to enter this emerging field.
Authors: Ismaila Salihou Adamou, Elsa Dupraz, Reza Asvadi, Tad Matsumoto
This paper addresses the design of practical shortlength coding schemes for Distributed Hypothesis Testing (DHT). While most prior work on DHT has focused on informationtheoretic analyses, deriving bounds on Type-II error exponents via achievability schemes based on quantization and quantizebinning, the practical implementation of DHT coding schemes has remained largely unexplored. Moreover, existing practical coding solutions for quantization and quantize-binning approaches were developed for source reconstruction tasks considering very long code length, and they are not directly applicable to DHT. In this context, this paper introduces efficient shortlength implementations of quantization and quantize-binning schemes for DHT, constructed from short binary linear block codes. Numerical results show the efficiency of the proposed coding schemes compared to uncoded cases and to existing schemes initially developed for data reconstruction. In addition to practical code design, the paper derives exact analytical expressions for the Type-I and Type-II error probabilities associated with each proposed scheme. The provided analytical expressions are shown to predict accurately the practical performance measured from Monte-Carlo simulations of the proposed schemes. These theoretical results are novel and offer a useful framework for optimizing and comparing practical DHT schemes across a wide range of source and code parameters.
Authors: Youwei Yu, Junhong Xu, Lantao Liu
Model-free reinforcement learning has emerged as a powerful method for developing robust robot control policies capable of navigating through complex and unstructured environments. The effectiveness of these methods hinges on two essential elements: (1) the use of massively parallel physics simulations to expedite policy training, and (2) an environment generator tasked with crafting sufficiently challenging yet attainable environments to facilitate continuous policy improvement. Existing methods of outdoor environment generation often rely on heuristics constrained by a set of parameters, limiting the diversity and realism. In this work, we introduce ADEPT, a novel \textbf{A}daptive \textbf{D}iffusion \textbf{E}nvironment for \textbf{P}olicy \textbf{T}ransfer in the zero-shot sim-to-real fashion that leverages Denoising Diffusion Probabilistic Models to dynamically expand existing training environments by adding more diverse and complex environments adaptive to the current policy. ADEPT guides the diffusion model's generation process through initial noise optimization, blending noise-corrupted environments from existing training environments weighted by the policy's performance in each corresponding environment. By manipulating the noise corruption level, ADEPT seamlessly transitions between generating similar environments for policy fine-tuning and novel ones to expand training diversity. To benchmark ADEPT in off-road navigation, we propose a fast and effective multi-layer map representation for wild environment generation. Our experiments show that the policy trained by ADEPT outperforms both procedural generated and natural environments, along with popular navigation methods.
Authors: Simon Mylius
All of the frontier AI companies have published safety frameworks where they define capability thresholds and risk mitigations that determine how they will safely develop and deploy their models. Adoption of systematic approaches to risk modelling, based on established practices used in safety-critical industries, has been recommended, however frontier AI companies currently do not describe in detail any structured approach to identifying and analysing hazards. STPA (Systems-Theoretic Process Analysis) is a systematic methodology for identifying how complex systems can become unsafe, leading to hazards. It achieves this by mapping out controllers and controlled processes then analysing their interactions and feedback loops to understand how harmful outcomes could occur (Leveson & Thomas, 2018). We evaluate STPA's ability to broaden the scope, improve traceability and strengthen the robustness of safety assurance for frontier AI systems. Applying STPA to the threat model and scenario described in 'A Sketch of an AI Control Safety Case' (Korbak et al., 2025), we derive a list of Unsafe Control Actions. From these we select a subset and explore the Loss Scenarios that lead to them if left unmitigated. We find that STPA is able to identify causal factors that may be missed by unstructured hazard analysis methodologies thereby improving robustness. We suggest STPA could increase the safety assurance of frontier AI when used to complement or check coverage of existing AI governance techniques including capability thresholds, model evaluations and emergency procedures. The application of a systematic methodology supports scalability by increasing the proportion of the analysis that could be conducted by LLMs, reducing the burden on human domain experts.
Authors: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at this https URL.
Authors: Haruki Yokota, Hiroshi Higashi, Yuichi Tanaka, Gene Cheung
Signed graphs are equipped with both positive and negative edge weights, encoding pairwise correlations as well as anti-correlations in data. A balanced signed graph is a signed graph with no cycles containing an odd number of negative edges. Laplacian of a balanced signed graph has eigenvectors that map via a simple linear transform to ones in a corresponding positive graph Laplacian, thus enabling reuse of spectral filtering tools designed for positive graphs. We propose an efficient method to learn a balanced signed graph Laplacian directly from data. Specifically, extending a previous linear programming (LP) based sparse inverse covariance estimation method called CLIME, we formulate a new LP problem for each Laplacian column $i$, where the linear constraints restrict weight signs of edges stemming from node $i$, so that nodes of same / different polarities are connected by positive / negative edges. Towards optimal model selection, we derive a suitable CLIME parameter $\rho$ based on a combination of the Hannan-Quinn information criterion and a minimum feasibility criterion. We solve the LP problem efficiently by tailoring a sparse LP method based on ADMM. We theoretically prove local solution convergence of our proposed iterative algorithm. Extensive experimental results on synthetic and real-world datasets show that our balanced graph learning method outperforms competing methods and enables reuse of spectral filters, wavelets, and graph convolutional nets (GCN) constructed for positive graphs.
Authors: Mattson Ogg, Rahul Hingorani, Diego Luna, Griffin W. Milsap, William G. Coon, Clara A. Scholl
Brain computer interface (BCI) research, as well as increasing portions of the field of neuroscience, have found success deploying large-scale artificial intelligence (AI) pre-training methods in conjunction with vast public repositories of data. This approach of pre-training foundation models using label-free, self-supervised objectives offers the potential to learn robust representations of neurophysiology, potentially addressing longstanding challenges in neural decoding. However, to date, much of this work has focused explicitly on standard BCI benchmarks and tasks, which likely overlooks the multitude of features these powerful methods might learn about brain function as well as other electrophysiological information. We introduce a new method for self-supervised BCI foundation model pre-training for EEG inspired by a transformer-based approach adapted from the HuBERT framework originally developed for speech processing. Our pipeline is specifically focused on low-profile, real-time usage, involving minimally pre-processed data and just eight EEG channels on the scalp. We show that our foundation model learned a representation of EEG that supports standard BCI tasks (P300, motor imagery), but also that this model learns features of neural data related to individual variability, and other salient electrophysiological components (e.g., alpha rhythms). In addition to describing and evaluating a novel approach to pre-training BCI models and neural decoding, this work opens the aperture for what kind of tasks and use-cases might exist for neural data in concert with powerful AI methods.
Authors: Leonardo J. Colombo, Manuel de León, María Emma Eyrea Irazú, Asier López-Gordón
This paper discusses reduction by symmetries for autonomous and non-autonomous forced mechanical systems with inelastic collisions. In particular, we introduce the notion of generalized hybrid momentum map and hybrid constants of the motion to give general conditions on whether it is possible to perform symmetry reduction for Hamiltonian and Lagrangian systems subject to non-conservative external forces and non-elastic impacts, as well as its extension to time-dependent mechanical systems subject to time-dependent external forces and time-dependent inelastic collisions. We illustrate the applicability of the method with examples and numerical simulations.
Authors: Julien Cornebise, Ivan Oršolić, Freddie Kalaitzis
Analyzing the planet at scale with satellite imagery and machine learning is a dream that has been constantly hindered by the cost of difficult-to-access highly-representative high-resolution imagery. To remediate this, we introduce here the WorldStrat dataset. The largest and most varied such publicly available dataset, at Airbus SPOT 6/7 satellites' high resolution of up to 1.5 m/pixel, empowered by European Space Agency's Phi-Lab as part of the ESA-funded QueryPlanet project, we curate nearly 10,000 sqkm of unique locations to ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities. We also enrich those with locations typically under-represented in ML datasets: sites of humanitarian interest, illegal mining sites, and settlements of persons at risk. We temporally-match each high-resolution image with multiple low-resolution images from the freely accessible lower-resolution Sentinel-2 satellites at 10 m/pixel. We accompany this dataset with an open-source Python package to: rebuild or extend the WorldStrat dataset, train and infer baseline algorithms, and learn with abundant tutorials, all compatible with the popular EO-learn toolbox. We hereby hope to foster broad-spectrum applications of ML to satellite imagery, and possibly develop from free public low-resolution Sentinel2 imagery the same power of analysis allowed by costly private high-resolution imagery. We illustrate this specific point by training and releasing several highly compute-efficient baselines on the task of Multi-Frame Super-Resolution. High-resolution Airbus imagery is CC BY-NC, while the labels and Sentinel2 imagery are CC BY, and the source code and pre-trained models under BSD. The dataset is available at this https URL and the software package at this https URL .
Authors: Xiaochun Ge, Shanping Yu, Wenqian Shen, Chengwen Xing, Byonghyo Shim
Beamforming design with partial channel estimation and feedback for frequency-division duplexing (FDD) reconfigurable intelligent surface (RIS) assisted systems is considered in this paper. We leverage the observation that path angle information (PAI) varies more slowly than path gain information (PGI). Then, several dominant paths are selected among all the cascaded paths according to the known PAI for maximizing the spectral efficiency of downlink data transmission. To acquire the dominating path gain information (DPGI, also regarded as the path gains of selected dominant paths) at the base station (BS), we propose a DPGI estimation and feedback scheme by jointly beamforming design at BS and RIS. Both the required number of downlink pilot signals and the length of uplink feedback vector are reduced to the number of dominant paths, and thus we achieve a great reduction of the pilot overhead and feedback overhead. Furthermore, we optimize the active BS beamformer and passive RIS beamformer by exploiting the feedback DPGI to further improve the spectral efficiency. From numerical results, we demonstrate the superiority of our proposed algorithms over the conventional schemes.
Authors: Haz Sameen Shahgir, Tanjeem Azwad Zaman, Khondker Salman Sayeed, Md. Asif Haider, Sheikh Saifur Rahman Jony, M. Sohel Rahman
Optical Coherence Tomography (OCT) scan yields all possible cross-section images of a retina for detecting biomarkers linked to optical defects. Due to the high volume of data generated, an automated and reliable biomarker detection pipeline is necessary as a primary screening stage. We outline our new state-of-the-art pipeline for identifying biomarkers from OCT scans. In collaboration with trained ophthalmologists, we identify local and global structures in biomarkers. Through a comprehensive and systematic review of existing vision architectures, we evaluate different convolution and attention mechanisms for biomarker detection. We find that MaxViT, a hybrid vision transformer combining convolution layers with strided attention, is better suited for local feature detection, while EVA-02, a standard vision transformer leveraging pure attention and large-scale knowledge distillation, excels at capturing global features. We ensemble the predictions of both models to achieve first place in the IEEE Video and Image Processing Cup 2023 competition on OCT biomarker detection, achieving a patient-wise F1 score of 0.8527 in the final phase of the competition, scoring 3.8\% higher than the next best solution. Finally, we used knowledge distillation to train a single MaxViT to outperform our ensemble at a fraction of the computation cost.
Authors: Gennaro Notomista, Mario Selvaggio, Francesca Pagano, María Santos, Siddharth Mayya, Vincenzo Lippiello, Cristian Secchi
The ability of executing multiple tasks simultaneously is an important feature of redundant robotic systems. As a matter of fact, complex behaviors can often be obtained as a result of the execution of several tasks. Moreover, in safety-critical applications, tasks designed to ensure the safety of the robot and its surroundings have to be executed along with other nominal tasks. In such cases, it is also important to prioritize the former over the latter. In this paper, we formalize the definition of extended set-based tasks, i.e., tasks which can be executed by rendering subsets of the task space asymptotically stable or forward invariant using control barrier functions. We propose a formal mathematical representation of such tasks that allows for the execution of more complex and time-varying prioritized stacks of tasks using kinematic and dynamic robot models alike. We present an optimization-based framework which is computationally efficient, accounts for input bounds, and allows for the stable execution of time-varying prioritized stacks of extended set-based tasks. The proposed framework is validated using extensive simulations, quantitative comparisons to the state-of-the-art hierarchical quadratic programming, and experiments with robotic manipulators.
Authors: Danwei Cai, Zexin Cai, Ze Li, Ming Li
Speaker representation learning is crucial for voice recognition systems, with recent advances in self-supervised approaches reducing dependency on labeled data. Current two-stage iterative frameworks, while effective, suffer from significant computational overhead due to repeated rounds of clustering and training. They also struggle with noisy pseudo labels that can impair model learning. This paper introduces self-supervised reflective learning (SSRL), an improved framework that addresses these limitations by enabling continuous refinement of pseudo labels during training. Through a teacher-student architecture and online clustering mechanism, SSRL eliminates the need for iterative training rounds. To handle label noise, we incorporate noisy label modeling and pseudo label queues that maintain temporal consistency. Experiments on VoxCeleb show SSRL's superiority over current two-stage iterative approaches, surpassing the performance of a 5-round method in just a single training round. Ablation studies validate the contributions of key components like noisy label modeling and pseudo label queues. Moreover, consistent improvements in pseudo labeling and the convergence of cluster counts demonstrate SSRL's effectiveness in deciphering unlabeled data. This work marks an important advancement in efficient and accurate self-supervised speaker representation learning through the novel reflective learning paradigm.
Authors: Kaito Ito, Taira Tsuchiya
This paper investigates the problem of controlling a linear system under possibly unbounded stochastic noise with unknown convex cost functions, known as an online control problem. In contrast to the existing work, which assumes the boundedness of noise, we show that an $ \tilde{O}(\sqrt{T}) $ high-probability regret can be achieved under unbounded noise, where $ T $ denotes the time horizon. Notably, the noise is only required to have a finite fourth moment. Moreover, when the costs are strongly convex and the noise is sub-Gaussian, we establish an $ O({\rm poly} (\log T)) $ regret bound.
Authors: Ming Li, Zhiyong Sun, Patrick J. W. Koelewijn, Siep Weiland
Sontag's universal formula is a widely used technique for stabilizing control through control Lyapunov functions. Recently, it has been extended to address safety-critical control by incorporating control barrier functions (CBFs). However, deriving a universal formula that satisfies requirements on essential properties, including safety, smoothness, and robustness against input disturbances, is still an open problem. To address this challenge, this paper introduces a novel solution - a tunable universal formula - by incorporating a (state-dependent) tunable term into Sontag's formula. This tunable term enables the regulation of safety-critical control performances, allowing the attainment of desired properties through a proper selection of tunable terms. Generally, the tunable universal formula can be seen as a controller that improves the quadratic program (QP)-synthesized controllers in terms of robustness and smoothness, while also reducing the conservatism (corresponding to robustness) in Sontag's formula. Furthermore, we extend the tunable universal formula to address safety-critical control problems with norm-bounded input constraints, showcasing its applicability across diverse control scenarios. Finally, we demonstrate the efficacy of our method through a two-link manipulator safe tracking example, investigating the essential properties including safety, smoothness, and robustness against input disturbances under various tunable terms.
Authors: MHD Anas Alsakkal, Runze Wang, Jayawan Wijekoon, Huajin Tang
Spike-based encoders represent information as sequences of spikes or pulses, which are transmitted between neurons. A prevailing consensus suggests that spike-based approaches demonstrate exceptional capabilities in capturing the temporal dynamics of neural activity and have the potential to provide energy-efficient solutions for low-power applications. The Spiketrum encoder efficiently compresses input data using spike trains or code sets (for non-spiking applications) and is adaptable to both hardware and software implementations, with lossless signal reconstruction capability. The paper proposes and assesses Spiketrum's hardware, evaluating its output under varying spike rates and its classification performance with popular spiking and non-spiking classifiers, and also assessing the quality of information compression and hardware resource utilization. The paper extensively benchmarks both Spiketrum hardware and its software counterpart against state-of-the-art, biologically-plausible encoders. The evaluations encompass benchmarking criteria, including classification accuracy, training speed, and sparsity when using encoder outputs in pattern recognition and classification with both spiking and non-spiking classifiers. Additionally, they consider encoded output entropy and hardware resource utilization and power consumption of the hardware version of the encoders. Results demonstrate Spiketrum's superiority in most benchmarking criteria, making it a promising choice for various applications. It efficiently utilizes hardware resources with low power consumption, achieving high classification accuracy. This work also emphasizes the potential of encoders in spike-based processing to improve the efficiency and performance of neural computing systems.
Authors: Anis Hamadouche, Mathini Sellathurai
This paper introduces a novel framework for tracking and predicting Channel State Information (CSI) by leveraging Physics-Informed Autoencoders (PIAE) integrated with a learned Koopman operator. The proposed approach models CSI as a nonlinear dynamical system governed by both intrinsic channel behavior and exogenous contextual factors such as position, temperature, and atmospheric conditions. The architecture comprises dual autoencoders-one dedicated to CSI and another to contextual inputs-linked via a shared latent state space, within which the Koopman operator captures the linear temporal evolution of CSI dynamics. This coupling enables accurate, data-driven forecasting of CSI trajectories while maintaining interpretability through a structured, physics-consistent representation. The framework supports real-time updates to the Channel Knowledge Map (CKM), enhancing the adaptability and reliability of communication systems in complex and time-varying environments. By unifying Koopman theory with learned latent representations, the proposed method provides a scalable and privacy-preserving solution for next-generation wireless networks. Empirical results demonstrate its effectiveness in delivering high-fidelity CSI predictions under diverse channel conditions.
Authors: Sunbochen Tang, Haoyuan Sun, Navid Azizan
Adaptive control achieves concurrent parameter learning and stable control under uncertainties that are linearly parameterized with known nonlinear features. Nonetheless, it is often difficult to obtain such nonlinear features. To address this difficulty, recent progress has been made in integrating meta-learning with adaptive control to learn such nonlinear features from data. However, these meta-learning-based control methods rely on classical adaptation laws using gradient descent, which is confined to the Euclidean geometry. In this paper, we propose a novel method that combines meta-learning and adaptation laws based on mirror descent, a popular generalization of gradient descent, which takes advantage of the potentially non-Euclidean geometry of the parameter space. In our approach, meta-learning not only learns the nonlinear features but also searches for a suitable mirror-descent potential function that optimizes control performance. Through numerical simulations, we demonstrate the effectiveness of the proposed method in learning efficient representations and real-time tracking control performance under uncertain dynamics.
Authors: Hsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao, Chi-Chun Lee
The human voice conveys not just words but also emotional states and individuality. Emotional voice conversion (EVC) modifies emotional expressions while preserving linguistic content and speaker identity, improving applications like human-machine interaction. While deep learning has advanced EVC models for specific target speakers on well-crafted emotional datasets, existing methods often face issues with emotion accuracy and speech distortion. In addition, the zero-shot scenario, in which emotion conversion is applied to unseen speakers, remains underexplored. This work introduces a novel diffusion framework with disentangled mechanisms and expressive guidance, trained on a large emotional speech dataset and evaluated on unseen speakers across in-domain and out-of-domain datasets. Experimental results show that our method produces expressive speech with high emotional accuracy, naturalness, and quality, showcasing its potential for broader EVC applications.
Authors: Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe
Traditional Text-to-Speech (TTS) systems rely on studio-quality speech recorded in controlled settings.a Recently, an effort known as noisy-TTS training has emerged, aiming to utilize in-the-wild data. However, the lack of dedicated datasets has been a significant limitation. We introduce the TTS In the Wild (TITW) dataset, which is publicly available, created through a fully automated pipeline applied to the VoxCeleb1 dataset. It comprises two training sets: TITW-Hard, derived from the transcription, segmentation, and selection of raw VoxCeleb1 data, and TITW-Easy, which incorporates additional enhancement and data selection based on DNSMOS. State-of-the-art TTS models achieve over 3.0 UTMOS score with TITW-Easy, while TITW-Hard remains difficult showing UTMOS below 2.8.
Authors: Henrik Åkesson, Diana Pamela Moya Osorio
Integrated Sensing and Communication (ISAC) systems are prone to privacy violations, once they aim at handling sensitive identifiable information in several applications. This paper raises the necessity of implementing privacy-preservation measures on the design of cell-free massive multiple-input multiple-output ISAC systems. To that purpose, given an adversary model, we propose an iterative framework of two blocks, precoder design and access point selection. The precoder design aims at maximizing the signal-to-interference-plus-noise ratio at the sensing receivers given communication constraints. The access point selection aims at minimizing the mutual information between the received signal at users and the sensing signal, by rearranging the access points that transmit ISAC-signals and the sensing receivers. Results show that a reduction in the probability of detection by the adversary is obtained with this method.
Authors: Nicolas Chatzikiriakos, Robin Strässer, Frank Allgöwer, Andrea Iannelli
In this paper we propose an end-to-end algorithm for indirect data-driven control for bilinear systems with stability guarantees. We consider the case where the collected i.i.d. data is affected by probabilistic noise with possibly unbounded support and leverage tools from statistical learning theory to derive finite sample identification error bounds. To this end, we solve the bilinear identification problem by solving a set of linear and affine identification problems, by a particular choice of a control input during the data collection phase. We provide a priori as well as data-dependent finite sample identification error bounds on the individual matrices as well as ellipsoidal bounds, both of which are structurally suitable for control. Further, we integrate the structure of the derived identification error bounds in a robust controller design to obtain an exponentially stable closed-loop. By means of an extensive numerical study we showcase the interplay between the controller design and the derived identification error bounds. Moreover, we note appealing connections of our results to indirect data-driven control of general nonlinear systems through Koopman operator theory and discuss how our results may be applied in this setup.
Authors: Chao Huang, Hao Zhang, Zhuping Wang
Recently, a system identification method based on center manifold is proposed to identify polynomial nonlinear systems with uncontrollable linearization. This note presents a numerical example to show the effectiveness of this method.
Authors: Zexin Sun, John Baillieul
There is increasing interest in developing the theoretical foundations of networked control systems that illuminate how brain networks function so as to enable sensory perception, control of movement, memory and all the operations that are needed for animals to survive. The present paper proposes a biologically inspired network model featuring dynamic connections regulated by Hebbian learning. Drawing on the machinery of graph theory and classical control we show that our novel nonlinear model exhibits such biologically plausible features as bounded evolution, stability, resilience, and a kind of structural stability -- meaning that perturbations of the model parameters leave the essential properties of the model in tact. The proposed network model involves generalized cactus graphs with multiple control input nodes, and it is shown that the properties of the network are resilient to various changes in network topology provided these changes preserve the generalized cactus structure. A particular example described in what follows is an idealized network model of the visual system of a macaque monkey. The model displays resilience to network disruptions such as might occur in a living organism due to disease or injury. A different model of the same type provides an example of a system that can perform data classification.
Authors: Enrico Marco Zucchelli, Erwin Mooij
Aerocapture leverages atmospheric drag to convert a spacecraft's hyperbolic trajectory into a bound orbit. For some aerocapture missions, heating due to the radiation of high temperature gases in the shock-layer can be much larger than the heat due to convection. This paper provides analytical proof and numerical validation that radiative heat load is minimized by the same trajectory that minimizes the final {\Delta} V: a single switch bang-bang trajectory, starting with lift up. The proof is very general and is valid for several formulations of radiative heat flux; further, the same proof can be used to conclude that convective heat load, computed according to many of the available formulations, is instead maximized by that trajectory. Further, a novel guidance that plans a bang-bang trajectory with constraints in the attitude kinematics is introduced. While achieving performance similar to that of the current state-of-the-art, the inclusion of constraints in attitude kinematics allows for much less tuning. Finally, a lateral guidance that makes use of information on the final inclination of the predicted trajectory is introduced. Such guidance allows for very high accuracy in the inclination requirements with only two reversals, by requiring a single parameter to be tuned.
Authors: Luke Bhan, Peijia Qin, Miroslav Krstic, Yuanyuan Shi
Predictor feedback designs are critical for delay-compensating controllers in nonlinear systems. However, these designs are limited in practical applications as predictors cannot be directly implemented, but require numerical approximation schemes, which become computationally prohibitive when system dynamics are expensive to compute. To address this challenge, we recast the predictor design as an operator learning problem, and learn the predictor mapping via a neural operator. We prove the existence of an arbitrarily accurate neural operator approximation of the predictor operator. Under the approximated predictor, we achieve semiglobal practical stability of the closed-loop nonlinear delay system. The estimate is semiglobal in a unique sense - one can enlarge the set of initial states as desired, though this increases the difficulty of training a neural operator, which appears practically in the stability estimate. Furthermore, our analysis holds for any black-box predictor satisfying the universal approximation error bound. We demonstrate the approach by controlling a 5-link robotic manipulator with different neural operator models, achieving significant speedups compared to classic predictor feedback schemes while maintaining closed-loop stability.
Authors: Liam Chalcroft, Jenny Crinion, Cathy J. Price, John Ashburner
Segmenting stroke lesions in MRI is challenging due to diverse acquisition protocols that limit model generalisability. In this work, we introduce two physics-constrained approaches to generate synthetic quantitative MRI (qMRI) images that improve segmentation robustness across heterogeneous domains. Our first method, $\texttt{qATLAS}$, trains a neural network to estimate qMRI maps from standard MPRAGE images, enabling the simulation of varied MRI sequences with realistic tissue contrasts. The second method, $\texttt{qSynth}$, synthesises qMRI maps directly from tissue labels using label-conditioned Gaussian mixture models, ensuring physical plausibility. Extensive experiments on multiple out-of-domain datasets show that both methods outperform a baseline UNet, with $\texttt{qSynth}$ notably surpassing previous synthetic data approaches. These results highlight the promise of integrating MRI physics into synthetic data generation for robust, generalisable stroke lesion segmentation. Code is available at this https URL
Authors: Omar H. Khater, Basem Almadani, Farouq Aliyu
Internet of Things (IoT) based healthcare systems offer significant potential for improving the delivery of healthcare services in humanitarian engineering, providing essential healthcare services to millions of underserved people in remote areas worldwide. However, these areas have poor network infrastructure, making communications difficult for traditional IoT. This paper presents a real-time chest X-ray classification system for hospitals in remote areas using FastDDS real-time middleware, offering reliable real-time communication. We fine-tuned a ResNet50 neural network to an accuracy of 88.61%, a precision of 88.76%, and a recall of 88.49\%. Our system results mark an average throughput of 3.2 KB/s and an average latency of 65 ms. The proposed system demonstrates how middleware-based systems can assist doctors in remote locations.
Authors: Maryann Rui, Munther Dahleh
We study the problem of learning mixtures of linear dynamical systems (MLDS) from input-output data. The mixture setting allows us to leverage observations from related dynamical systems to improve the estimation of individual models. Building on spectral methods for mixtures of linear regressions, we propose a moment-based estimator that uses tensor decomposition to estimate the impulse response parameters of the mixture models. The estimator improves upon existing tensor decomposition approaches for MLDS by utilizing the entire length of the observed trajectories. We provide sample complexity bounds for estimating MLDS in the presence of noise, in terms of both the number of trajectories $N$ and the trajectory length $T$, and demonstrate the performance of the estimator through simulations.
Authors: Runci Bai, Guibao Xu, Yanze Shi
Brain tumors can lead to neurological dysfunction, cognitive and psychological changes, increased intracranial pressure, and seizures, posing significant risks to health. The You Only Look Once (YOLO) series has shown superior accuracy in medical imaging object detection. This paper presents a novel SCC-YOLO architecture that integrates the SCConv module into YOLOv9. The SCConv module optimizes convolutional efficiency by reducing spatial and channel redundancy, enhancing image feature learning. We examine the effects of different attention mechanisms with YOLOv9 for brain tumor detection using the Br35H dataset and our custom dataset (Brain_Tumor_Dataset). Results indicate that SCC-YOLO improved mAP50 by 0.3% on the Br35H dataset and by 0.5% on our custom dataset compared to YOLOv9. SCC-YOLO achieves state-of-the-art performance in brain tumor detection.
Authors: I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Yuan-Chun Chiang, Sy-Yen Kuo, Ming-Hsuan Yang
Image restoration aims to recover content from inputs degraded by various factors, such as adverse weather, blur, and noise. Perceptual Image Restoration (PIR) methods improve visual quality but often do not support downstream tasks effectively. On the other hand, Task-oriented Image Restoration (TIR) methods focus on enhancing image utility for high-level vision tasks, sometimes compromising visual quality. This paper introduces UniRestore, a unified image restoration model that bridges the gap between PIR and TIR by using a diffusion prior. The diffusion prior is designed to generate images that align with human visual quality preferences, but these images are often unsuitable for TIR scenarios. To solve this limitation, UniRestore utilizes encoder features from an autoencoder to adapt the diffusion prior to specific tasks. We propose a Complementary Feature Restoration Module (CFRM) to reconstruct degraded encoder features and a Task Feature Adapter (TFA) module to facilitate adaptive feature fusion in the decoder. This design allows UniRestore to optimize images for both human perception and downstream task requirements, addressing discrepancies between visual quality and functional needs. Integrating these modules also enhances UniRestore's adapability and efficiency across diverse tasks. Extensive expertments demonstrate the superior performance of UniRestore in both PIR and TIR scenarios.
Authors: Rajath Rao, Adithya Ganesan, Oscar Kjell, Jonah Luby, Akshay Raghavan, Scott Feltman, Whitney Ringwald, Ryan L. Boyd, Benjamin Luft, Camilo Ruggero, Neville Ryant, Roman Kotov, H. Andrew Schwartz
Current speech encoding pipelines often rely on an additional text-based LM to get robust representations of human communication, even though SotA speech-to-text models often have a LM within. This work proposes an approach to improve the LM within an audio model such that the subsequent text-LM is unnecessary. We introduce WhiSPA (Whisper with Semantic and Psychological Alignment), which leverages a novel audio training objective: contrastive loss with a language model embedding as a teacher. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper's latent space with semantic representations from a text autoencoder (SBERT) and lexically derived embeddings of basic psychological dimensions: emotion and personality. Over self-supervised affective tasks and downstream psychological tasks, WhiSPA surpasses current speech encoders, achieving an average error reduction of 73.4% and 83.8%, respectively. WhiSPA demonstrates that it is not always necessary to run a subsequent text LM on speech-to-text output in order to get a rich psychological representation of human communication.
Authors: MHD Anas Alsakkal, Runze Wang, Piotr Dudek, Jayawan Wijekoon
Spiking Neural Networks (SNNs) offer a biologically inspired computational paradigm, enabling energy-efficient data processing through spike-based information transmission. Despite notable advancements in hardware for SNNs, spike encoding has largely remained software-dependent, limiting efficiency. This paper addresses the need for adaptable and resource-efficient spike encoding hardware by presenting an area-optimized hardware implementation of the Spiketrum algorithm, which encodes time-varying analogue signals into spatiotemporal spike patterns. Unlike earlier performance-optimized designs, which prioritize speed, our approach focuses on reducing hardware footprint, achieving a 52% reduction in Block RAMs (BRAMs), 31% fewer Digital Signal Processing (DSP) slices, and a 6% decrease in Look-Up Tables (LUTs). The proposed implementation has been verified on an FPGA and successfully integrated into an IC using TSMC180 technology. Experimental results demonstrate the system's effectiveness in real-world applications, including sound and ECG classification. This work highlights the trade-offs between performance and resource efficiency, offering a flexible, scalable solution for neuromorphic systems in power-sensitive applications like cochlear implants and neural devices.
Authors: Titus Griebel, Anwai Archit, Constantin Pape
Nucleus segmentation is an important analysis task in digital pathology. However, methods for automatic segmentation often struggle with new data from a different distribution, requiring users to manually annotate nuclei and retrain data-specific models. Vision foundation models (VFMs), such as the Segment Anything Model (SAM), offer a more robust alternative for automatic and interactive segmentation. Despite their success in natural images, a foundation model for nucleus segmentation in histopathology is still missing. Initial efforts to adapt SAM have shown some success, but did not yet introduce a comprehensive model for diverse segmentation tasks. To close this gap, we introduce PathoSAM, a VFM for nucleus segmentation, based on training SAM on a diverse dataset. Our extensive experiments show that it is the new state-of-the-art model for automatic and interactive nucleus instance segmentation in histopathology. We also demonstrate how it can be adapted for other segmentation tasks, including semantic nucleus segmentation. For this task, we show that it yields results better than popular methods, while not yet beating the state-of-the-art, CellViT. Our models are open-source and compatible with popular tools for data annotation. We also provide scripts for whole-slide image segmentation. Our code and models are publicly available at this https URL.
Authors: Xianghui Ze, Zhenbo Song, Qiwei Wang, Jianfeng Lu, Yujiao Shi
Generating street-view images from satellite imagery is a challenging task, particularly in maintaining accurate pose alignment and incorporating diverse environmental conditions. While diffusion models have shown promise in generative tasks, their ability to maintain strict pose alignment throughout the diffusion process is limited. In this paper, we propose a novel Iterative Homography Adjustment (IHA) scheme applied during the denoising process, which effectively addresses pose misalignment and ensures spatial consistency in the generated street-view images. Additionally, currently, available datasets for satellite-to-street-view generation are limited in their diversity of illumination and weather conditions, thereby restricting the generalizability of the generated outputs. To mitigate this, we introduce a text-guided illumination and weather-controlled sampling strategy that enables fine-grained control over the environmental factors. Extensive quantitative and qualitative evaluations demonstrate that our approach significantly improves pose accuracy and enhances the diversity and realism of generated street-view images, setting a new benchmark for satellite-to-street-view generation tasks.
Authors: Jagabandhu Mishra, Manasi Chhibber, Hye-jin Shim, Tomi H. Kinnunen
We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of duration and conversion modeling in spoofing detection; and waveform generation and speaker modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve $99.7\%$ balanced accuracy and $0.22\%$ equal error rate (EER), closely matching the performance of raw embeddings ($99.9\%$ balanced accuracy and $0.22\%$ EER). Similarly, in the attribution task, our embeddings achieve $90.23\%$ balanced accuracy and $2.07\%$ EER, compared to $90.16\%$ and $2.11\%$ with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.
Authors: Yuchen Zhang, Pinjun Zheng, Jie Ma, Henk Wymeersch, Tareq Y. Al-Naffouri
We investigate a multi-low Earth orbit (LEO) satellite system that simultaneously provides positioning and communication services to terrestrial user terminals. To address the challenges of channel estimation in LEO satellite systems, we propose a novel two-timescale positioning-aided channel estimation framework, exploiting the distinct variation rates of position-related parameters and channel gains inherent in LEO satellite channels. Using the misspecified Cramér-Rao bound (MCRB) theory, we systematically analyze positioning performance under practical imperfections, such as inter-satellite clock bias and carrier frequency offset. Furthermore, we theoretically demonstrate how position information derived from downlink positioning can enhance uplink channel estimation accuracy, even in the presence of positioning errors, through an MCRB-based analysis. To overcome the constraints of limited link budgets and communication rates associated with single-satellite-based communication, we develop a multi-LEO satellite cooperative beamforming strategy for downlink communication, capitalizing on the benefit of cluster-wise satellites cooperation. Theoretical analyses and numerical results confirm the effectiveness of the proposed framework in achieving high-precision downlink positioning under practical imperfections, facilitating uplink channel estimation, and enabling efficient downlink communication.
Authors: Feng Guo, Luis D. Couto
This study evaluates numerical discretization methods for the Single Particle Model (SPM) used in electrochemical modeling. The methods include the Finite Difference Method (FDM), spectral methods, Padé approximation, and parabolic approximation. Evaluation criteria are accuracy, execution time, and memory usage, aiming to guide method selection for electrochemical models. Under constant current conditions, the FDM explicit Euler and Runge-Kutta methods show significant errors, while the FDM implicit Euler method improves accuracy with more nodes. The spectral method achieves the best accuracy and convergence with as few as five nodes. The Padé approximation exhibits increasing errors with higher current, and the parabolic approximation shows higher errors than the converged spectral and FDM implicit Euler methods. Under dynamic conditions, frequency domain analysis indicates that the FDM, spectral, and Padé approximation methods improve high-frequency response by increasing node count or method order. In terms of execution time, the parabolic method is fastest, followed by the Padé approximation. The spectral method is faster than FDM, while the FDM implicit Euler method is the slowest. Memory usage is lowest for the parabolic and Padé methods, moderate for FDM, and highest for the spectral method. These findings provide practical guidance for selecting discretization methods under different operating scenarios.
Authors: Max van Haren, Lennart Blanken, Tom Oomen
Frequency-domain performance analysis of intersample behavior in sampled-data and multirate systems is challenging due to the lack of a frequency-separation principle, and systematic identification techniques are lacking. The aim of this \manuscript is to develop an efficient technique for identifying the full intersample performance in the frequency-domain for closed-loop multirate systems, in particular the Performance Frequency Gain (PFG). Through local modeling techniques, aliased frequency components are effectively disentangled when identifying the PFG, which is directly facilitated by frequency-lifting the multirate system to a multivariable time-invariant representation. The developed method accurately and directly identifies the PFG in a single identification experiment. Finally, the developed method is experimentally validated on a prototype motion system, showing accurate identification of frequency-domain representations for the multirate system, including the PFG.
Authors: Alvin Combrink, Sabino Francesco Roselli, Martin Fabian
Multi-agent Path Finding (MAPF) is the problem of planning collision-free movements of agents so that they get from where they are to where they need to be. Commonly, agents are located on a graph and can traverse edges. This problem has many variations and has been studied for decades. Two such variations are the continuous-time and the lifelong MAPF problems. In the former, edges have non-unit lengths and volumetric agents can traverse them at any real-valued time. In the latter, agents must attend to a continuous stream of incoming tasks. Much work has been devoted to designing solution methods within these two areas. To our knowledge, however, the combined problem of continuous-time lifelong MAPF has yet to be addressed. This work addresses continuous-time lifelong MAPF with volumetric agents by presenting the fast and sub-optimal Continuous-time Prioritized Lifelong Planner (CPLP). CPLP continuously assigns agents to tasks and computes plans using a combination of two path planners; one based on CCBS and the other based on SIPP. Experimental results with up to 800 agents on graphs with up to 12 000 vertices demonstrate practical performance, where maximum planning times fall within the available time budget. Additionally, CPLP ensures collision-free movement even when failing to meet this budget. Therefore, the robustness of CPLP highlights its potential for real-world applications.
Authors: Max Langtry, Ruchi Choudhary
Energy storage is needed to match renewable generation to industrial loads in energy parks. However, the future performance of bulk storage technologies is currently highly uncertain. Due to the urgency of decarbonization targets, energy park projects must be designed and begun now. But, as uncertainty in storage performance reduces, a different technology than identified during initial design may turn out cheaper. Enabling flexibility so that design adaptations can be made as better information becomes available would lower the cost of decarbonizing industry. But having this flexibility is itself costly. This raises the question, "Is it worth it?" This study quantifies the benefit of retaining flexibility to adapt energy park designs and optionality over storage technology choice as uncertainty reduces, to determine whether it is economically worthwhile. It applies the Value of Information analysis framework to the sizing of wind, solar, and storage in an illustrative energy park model based on a real-world proposal near Rotterdam, considering uncertainty in storage efficiency, lifetime, and capital cost. Updating asset sizings after storage uncertainty reduced is found to reduce total costs by 18% on average. Having the option to switch storage technology choice as well reduces costs by a further 13%, which is substantially greater than the cost of providing storage optionality. Using two storage technologies in the energy park reduces costs by 14%, and in this case storage optionality is not worthwhile. These results are robust to the level of uncertainty reduction in storage performance, and the risk aversion of the system designer.
Authors: Quentin Rommel, Michael Hibbard, Pavan Shukla, Himanshu Save, Srinivas Bettadpur, Ufuk Topcu
As space missions become more complex, planning methods must maximize mission performance while rigorously enforcing safety. We develop a probabilistic approach based on a finite-horizon Markov decision process to optimize spacecraft operations planning with safety guarantees. In the model, states capture essential mission parameters, and actions represent the operational adjustments needed to meet mission objectives. By directly incorporating uncertainties from environmental conditions and spacecraft dynamics, an optimal sequence of actions is computed that maximizes expected rewards and strictly enforces safety constraints. Numerical experiments on the GRACE-FO mission demonstrate robust performance under uncertainties while providing probabilistic safety guarantees, offering a reliable solution for autonomous spacecraft operations.
Authors: Kohei Saijo, Tetsuji Ogawa
In this study, we investigate the impact of positional encoding (PE) on source separation performance and the generalization ability to long sequences (length extrapolation) in Transformer-based time-frequency (TF) domain dual-path models. The length extrapolation capability in TF-domain dual-path models is a crucial factor, as it affects not only their performance on long-duration inputs but also their generalizability to signals with unseen sampling rates. While PE is known to significantly impact length extrapolation, there has been limited research that explores the choice of PEs for TF-domain dual-path models from this perspective. To address this gap, we compare various PE methods using a recent state-of-the-art model, TF-Locoformer, as the base architecture. Our analysis yields the following key findings: (i) When handling sequences that are the same length as or shorter than those seen during training, models with PEs achieve better performance. (ii) However, models without PE exhibit superior length extrapolation. This trend is particularly pronounced when the model contains convolutional layers.
Authors: Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
We propose Omni-R1 which fine-tunes a recent multi-modal LLM, Qwen2.5-Omni, on an audio question answering dataset with the reinforcement learning method GRPO. This leads to new State-of-the-Art performance on the recent MMAU and MMAR benchmarks. Omni-R1 achieves the highest accuracies on the sounds, music, speech, and overall average categories, both on the Test-mini and Test-full splits. To understand the performance improvement, we tested models both with and without audio and found that much of the performance improvement from GRPO could be attributed to better text-based reasoning. We also made a surprising discovery that fine-tuning without audio on a text-only dataset was effective at improving the audio-based performance.
Authors: Zengrui Han, Lu Bai, Ziwei Huang, Xiang Cheng
Guided by Synesthesia of Machines (SoM), the nonlinear mapping relationship between sensory and communication information serves as a powerful tool to enhance both the accuracy and generalization of vehicle-to-vehicle (V2V) multi-modal intelligent channel modeling (MMICM) in intelligent transportation systems (ITSs). To explore the general mapping relationship between physical environment and electromagnetic space, a new intelligent sensing-communication integration dataset, named V2V-M3, is constructed for multiple scenarios in V2V communications with multiple frequency bands and multiple vehicular traffic densities (VTDs). Leveraging the strong representation and cross-modal inference capabilities of large language models (LLMs), a novel LLM-based method for Scatterer Generation (LLM4SG) from light detection and ranging (LiDAR) point clouds is developed. To address the inherent and significant differences across multi-modal data, synergistically optimized four-module architecture, i.e., preprocessor, embedding, backbone, and output modules, are designed by considering the sensing/channel characteristics and electromagnetic propagation mechanism. On the basis of cross-modal representation alignment and positional encoding, the network of LLM4SG is fine-tuned to capture the general mapping relationship between LiDAR point clouds and scatterers. Simulation results demonstrate that the proposed LLM4SG achieves superior performance in full-sample and generalization testing, significantly outperforming small models across different frequency bands, scenarios, and VTDs.
Authors: Peihong Zhang, Zhixin Li, Rui Sang, Yuxuan Liu, Yiqiang Cai, Yizhou Tan, Shengchen Li
Electrocardiogram (ECG) and Phonocardiogram (PCG) signals are linked by a latent coupling signal representing the electrical-to-mechanical cardiac transformation. While valuable for cardiovascular disease (CVD) detection, this coupling signal is traditionally estimated using deconvolution methods that amplify noise, limiting clinical utility. In this paper, we propose Noise-Robust Multi-Modal Coupling Signal Estimation (NMCSE), which reformulates the problem as distribution matching via optimal transport theory. By jointly optimizing amplitude and temporal alignment, NMCSE mitigates noise amplification without additional preprocessing. Integrated with our Temporal-Spatial Feature Extraction network, NMCSE enables robust multi-modal CVD detection. Experiments on the PhysioNet 2016 dataset with realistic hospital noise demonstrate that NMCSE reduces estimation errors by approximately 30% in Mean Squared Error while maintaining higher Pearson Correlation Coefficients across all tested signal-to-noise ratios. Our approach achieves 97.38% accuracy and 0.98 AUC in CVD detection, outperforming state-of-the-art methods and demonstrating robust performance for real-world clinical applications.
Authors: Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak
Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.
Authors: Puyuan Peng, Shang-Wen Li, Abdelrahman Mohamed, David Harwath
We present VoiceStar, the first zero-shot TTS model that achieves both output duration control and extrapolation. VoiceStar is an autoregressive encoder-decoder neural codec language model, that leverages a novel Progress-Monitoring Rotary Position Embedding (PM-RoPE) and is trained with Continuation-Prompt Mixed (CPM) training. PM-RoPE enables the model to better align text and speech tokens, indicates the target duration for the generated speech, and also allows the model to generate speech waveforms much longer in duration than those seen during. CPM training also helps to mitigate the training/inference mismatch, and significantly improves the quality of the generated speech in terms of speaker similarity and intelligibility. VoiceStar outperforms or is on par with current state-of-the-art models on short-form benchmarks such as Librispeech and Seed-TTS, and significantly outperforms these models on long-form/extrapolation benchmarks (20-50s) in terms of intelligibility and naturalness. Code and models: this https URL. Audio samples: this https URL
Authors: Lucas Ueda, João Lima, Leonardo Marques, Paula Costa
Emotion plays a fundamental role in human interaction, and therefore systems capable of identifying emotions in speech are crucial in the context of human-computer interaction. Speech emotion recognition (SER) is a challenging problem, particularly in natural speech and when the available data is imbalanced across emotions. This paper presents our proposed system in the context of the 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge. Our proposed architecture leverages cross-modality, utilizing cross-modal attention to fuse representations from different modalities. To address class imbalance, we employed two training designs: (i) weighted crossentropy loss (WCE); and (ii) WCE with an additional neutralexpressive soft margin loss and balancing. We trained a total of 12 multimodal models, which were ensembled using a balanced stacking model. Our proposed system achieves a MacroF1 score of 0.4094 and an accuracy of 0.4128 on 8-class speech emotion recognition.
Authors: Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe
There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions. The URGENT Challenge series aims to foster such universal SE by embracing a broad range of distortion types, increasing data diversity, and incorporating extensive evaluation metrics. This work introduces the Interspeech 2025 URGENT Challenge, the second edition of the series, to explore several aspects that have received limited attention so far: language dependency, universality for more distortion types, data scalability, and the effectiveness of using noisy training data. We received 32 submissions, where the best system uses a discriminative model, while most other competitive ones are hybrid methods. Analysis reveals some key findings: (i) some generative or hybrid approaches are preferred in subjective evaluations over the top discriminative model, and (ii) purely generative SE models can exhibit language dependency.
Authors: Yunliang Qi, Meng Lou, Yimin Liu, Lu Li, Zhen Yang, Wen Nie
Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.
Authors: Aditya Retnanto (1), Son Le (1), Sebastian Mueller (1), Armin Leitner (2), Michael Riffler (2), Konrad Schindler (3), Yohan Iddawela (1) ((1) Asian Development Bank, Philippines, (2) GeoVille Information Systems and Data Processing GmbH, Austria, (3) ETH Zürich, Switzerland)
Super-resolution aims to increase the resolution of satellite images by reconstructing high-frequency details, which go beyond naïve upsampling. This has particular relevance for Earth observation missions like Sentinel-2, which offer frequent, regular coverage at no cost; but at coarse resolution. Its pixel footprint is too large to capture small features like houses, streets, or hedge rows. To address this, we present SEN4X, a hybrid super-resolution architecture that combines the advantages of single-image and multi-image techniques. It combines temporal oversampling from repeated Sentinel-2 acquisitions with a learned prior from high-resolution Pléiades Neo data. In doing so, SEN4X upgrades Sentinel-2 imagery to 2.5 m ground sampling distance. We test the super-resolved images on urban land-cover classification in Hanoi, Vietnam. We find that they lead to a significant performance improvement over state-of-the-art super-resolution baselines.
Authors: Razi Mahmood, Diego Machado Reyes, Ge Wang, Mannudeep Kalra, Pingkun Yan
With advances in generative artificial intelligence (AI), it is now possible to produce realistic-looking automated reports for preliminary reads of radiology images. This can expedite clinical workflows, improve accuracy and reduce overall costs. However, it is also well-known that such models often hallucinate, leading to false findings in the generated reports. In this paper, we propose a new method of fact-checking of AI-generated reports using their associated images. Specifically, the developed examiner differentiates real and fake sentences in reports by learning the association between an image and sentences describing real or potentially fake findings. To train such an examiner, we first created a new dataset of fake reports by perturbing the findings in the original ground truth radiology reports associated with images. Text encodings of real and fake sentences drawn from these reports are then paired with image encodings to learn the mapping to real/fake labels. The utility of such an examiner is demonstrated for verifying automatically generated reports by detecting and removing fake sentences. Future generative AI approaches can use the resulting tool to validate their reports leading to a more responsible use of AI in expediting clinical workflows.
Authors: Alejandro Parada-Mayorga, Alejandro Ribeiro
In this work, we study the properties of sampling sets on families of large graphs by leveraging the theory of graphons and graph limits. To this end, we extend to graphon signals the notion of removable and uniqueness sets, which was developed originally for the analysis of signals on graphs. We state the formal definition of a $\Lambda-$removable set and conditions under which a bandlimited graphon signal can be represented in a unique way when its samples are obtained from the complement of a given $\Lambda-$removable set in the graphon. By leveraging such results we show that graphon representations of graphs and graph signals can be used as a common framework to compare sampling sets between graphs with different numbers of nodes and edges, and different node labelings. Additionally, given a sequence of graphs that converges to a graphon, we show that the sequences of sampling sets whose graphon representation is identical in $[0,1]$ are convergent as well. We exploit the convergence results to provide an algorithm that obtains approximately close to optimal sampling sets. Performing a set of numerical experiments, we evaluate the quality of these sampling sets. Our results open the door for the efficient computation of optimal sampling sets in graphs of large size.
Authors: Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Junhao Huang, Conghui He, Dahua Lin, Jiaqi Wang
Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized large language model (LLM) that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs by leveraging three key innovations: 1) a flexible tuple format for word-level alignment of lyrics and melodies, 2) an extended tokenizer vocabulary for song notes, with scalar initialization based on musical knowledge to capture rhythm, and 3) a multi-stage pipeline that captures musical structure, starting with motif-level melody patterns and progressing to phrase-level structure for improved coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. Moreover, we will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.
Authors: Jiacheng Jiang, Hongjiang Lei, Ki-Hong Park, Gaofeng Pan, Mohamed-Slim Alouini
In this work, a delay-tolerant unmanned aerial vehicle (UAV) relayed covert and secure communication framework is investigated. In this framework, a legitimate UAV serves as an aerial relay to realize communication when the direct link between the terrestrial transmitter and receiver is blocked and also acts as a friendly jammer to suppress the malicious nodes presented on the ground. Subsequently, considering the uncertainty of malicious nodes' positions, a robust fractional programming optimization problem is built to maximize energy efficiency by jointly optimizing the trajectory of the UAV, the transmit power of the transmitter, and the time-switching factor. For the extremely complicated covert constraint, Pinsker's inequality, Jensen's inequality, and the bisection search method are employed to construct a tractable shrunken one. After this, an alternate optimization-based algorithm is proposed to solve the fractional programming optimization problem. To achieve low complexity, we design the primal-dual search-based algorithm and the successive convex approximation-based algorithm, respectively, for each sub-problem. Numerical results show the effectiveness of our proposed algorithm.
Authors: Haifeng Wen, Hong Xing, Osvaldo Simeone
For modern artificial intelligence (AI) applications such as large language models (LLMs), the training paradigm has recently shifted to pre-training followed by fine-tuning. Furthermore, owing to dwindling open repositories of data and thanks to efforts to democratize access to AI models, pre-training is expected to increasingly migrate from the current centralized deployments to federated learning (FL) implementations. Meta-learning provides a general framework in which pre-training and fine-tuning can be formalized. Meta-learning-based personalized FL (meta-pFL) moves beyond basic personalization by targeting generalization to new agents and tasks. This paper studies the generalization performance of meta-pFL for a wireless setting in which the agents participating in the pre-training phase, i.e., meta-learning, are connected via a shared wireless channel to the server. Adopting over-the-air computing, we study the trade-off between generalization to new agents and tasks, on the one hand, and convergence, on the other hand. The trade-off arises from the fact that channel impairments may enhance generalization, while degrading convergence. Extensive numerical results validate the theory.
Authors: Aleksandr Berezin, Stephan Balduin, Thomas Oberließen, Sebastian Peter, Eric MSP Veith
This paper addresses the challenge of neural state estimation in power distribution systems. We identified a research gap in the current state of the art, which lies in the inability of models to adapt to changes in the power grid, such as loss of sensors and branch switching, in a zero-shot fashion. Based on the literature, we identified graph neural networks as the most promising class of models for this use case. Our experiments confirm their robustness to some grid changes and also show that a deeper network does not always perform better. We propose data augmentations to improve performance and conduct a comprehensive grid search of different model configurations for common zero-shot learning scenarios.
Authors: Yiming Shu, Jingyuan Zhou, Fu Zhang
Efficiency is critical for autonomous vehicles (AVs), especially for emergency AVs. However, most existing methods focus on regular vehicles, overlooking the distinct strategies required by emergency vehicles to address the challenge of maximizing efficiency while ensuring safety. In this paper, we propose an Integrated Agile Decision-Making with Active and Safety-Critical Motion Planning System (IDEAM). IDEAM focuses on enabling emergency AVs, such as ambulances, to actively attain efficiency in dense traffic scenarios with safety in mind. Firstly, the speed-centric decision-making algorithm named the long short-term spatio-temporal graph-centric decision-making (LSGM) is given. LSGM comprises conditional depth-first search (C-DFS) for multiple paths generation as well as methods for speed gains and risk evaluation for path selection, which presents a robust algorithm for high efficiency and safety consideration. Secondly, with an output path from LSGM, the motion planner reconsiders environmental conditions to decide constraints states for the final planning stage, among which the lane-probing state is designed for actively attaining spatial and speed advantage. Thirdly, under the Frenet-based model predictive control (MPC) framework with final constraints state and selected path, the safety-critical motion planner employs decoupled discrete control barrier functions (DCBFs) and linearized discrete-time high-order control barrier functions (DHOCBFs) to model the constraints associated with different driving behaviors, making the optimal optimization problem convex. Finally, we extensively validate our system using scenarios from a randomly synthetic dataset, demonstrating its capability to achieve speed benefits and assure safety simultaneously.
Authors: Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan
Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social communication, repetitive behavior, and sensory processing. One important research area in ASD is evaluating children's behavioral changes over time during treatment. The standard protocol with this objective is BOSCC, which involves dyadic interactions between a child and clinicians performing a pre-defined set of activities. A fundamental aspect of understanding children's behavior in these interactions is automatic speech understanding, particularly identifying who speaks and when. Conventional approaches in this area heavily rely on speech samples recorded from a spectator perspective, and there is limited research on egocentric speech modeling. In this study, we design an experiment to perform speech sampling in BOSCC interviews from an egocentric perspective using wearable sensors and explore pre-training Ego4D speech samples to enhance child-adult speaker classification in dyadic interactions. Our findings highlight the potential of egocentric speech collection and pre-training to improve speaker classification accuracy.
Authors: Soroosh Tayebi Arasteh, Mahshad Lotfinia, Paula Andrea Perez-Toro, Tomas Arias-Vergara, Mahtab Ranji, Juan Rafael Orozco-Arroyave, Maria Schuster, Andreas Maier, Seung Hee Yang
Speech pathology has impacts on communication abilities and quality of life. While deep learning-based models have shown potential in diagnosing these disorders, the use of sensitive data raises critical privacy concerns. Although differential privacy (DP) has been explored in the medical imaging domain, its application in pathological speech analysis remains largely unexplored despite the equally critical privacy concerns. To the best of our knowledge, this study is the first to investigate DP's impact on pathological speech data, focusing on the trade-offs between privacy, diagnostic accuracy, and fairness. Using a large, real-world dataset of 200 hours of recordings from 2,839 German-speaking participants, we observed a maximum accuracy reduction of 3.85% when training with DP with high privacy levels. To highlight real-world privacy risks, we demonstrated the vulnerability of non-private models to gradient inversion attacks, reconstructing identifiable speech samples and showcasing DP's effectiveness in mitigating these risks. To explore the potential generalizability across languages and disorders, we validated our approach on a dataset of Spanish-speaking Parkinson's disease patients, leveraging pretrained models from healthy English-speaking datasets, and demonstrated that careful pretraining on large-scale task-specific datasets can maintain favorable accuracy under DP constraints. A comprehensive fairness analysis revealed minimal gender bias at reasonable privacy levels but underscored the need for addressing age-related disparities. Our results establish that DP can balance privacy and utility in speech disorder detection, while highlighting unique challenges in privacy-fairness trade-offs for speech data. This provides a foundation for refining DP methodologies and improving fairness across diverse patient groups in real-world deployments.
Authors: Jesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi
In this work, we present AfriHuBERT, an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages. While mHuBERT-147 covered 16 African languages, we expand this to 1,226 through continued pretraining on 10K+ hours of speech data from diverse sources, benefiting an African population of over 600M. We evaluate AfriHuBERT on two key speech tasks, Spoken Language Identification (SLID) and Automatic Speech Recognition (ASR), using the FLEURS benchmark. Our results show a +3.6% F1 score improvement for SLID and a -2.1% average Word Error Rate (WER) reduction for ASR over mHuBERT-147, and demonstrates competitiveness with larger SSL models such as MMS and XEUS. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization and are competitive in extremely low-resource ASR scenarios.
Authors: Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr, Filippos Kokkinos
Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. Subsequently, a view selection pipeline filters these views based on quality and consistency, ensuring that only the high-quality and reliable views are used for reconstruction. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. FlemRM directly outputs 3D Gaussian points leveraging a tri-plane representation, enabling efficient and detailed 3D generation. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.
Authors: Guoqiang Wu, Cheng Hu, Wangjia Weng, Zhouheng Li, Yonghao Fu, Lei Xie, Hongye Su
Extreme cornering in racing often leads to large sideslip angles, presenting a significant challenge for vehicle control. Conventional vehicle controllers struggle to manage this scenario, necessitating the use of a drifting controller. However, the large sideslip angle in drift conditions introduces model mismatch, which in turn affects control precision. To address this issue, we propose a model correction drift controller that integrates Model Predictive Control (MPC) with Gaussian Process Regression (GPR). GPR is employed to correct vehicle model mismatches during both drift equilibrium solving and the MPC optimization process. Additionally, the variance from GPR is utilized to actively explore different cornering drifting velocities, aiming to minimize trajectory tracking errors. The proposed algorithm is validated through simulations on the Simulink-Carsim platform and experiments with a 1:10 scale RC vehicle. In the simulation, the average lateral error with GPR is reduced by 52.8% compared to the non-GPR case. Incorporating exploration further decreases this error by 27.1%. The velocity tracking Root Mean Square Error (RMSE) also decreases by 10.6% with exploration. In the RC car experiment, the average lateral error with GPR is 36.7% lower, and exploration further leads to a 29.0% reduction. Moreover, the velocity tracking RMSE decreases by 7.2% with the inclusion of exploration.
Authors: Alejandro Parada-Mayorga, Leopoldo Agorio, Alejandro Ribeiro, Juan Bazerque
In this paper, we develop a generalized theory of convolutional signal processing and neural networks for Reproducing Kernel Hilbert Spaces (RKHS). Leveraging the theory of algebraic signal processing (ASP), we show that any RKHS allows the formal definition of multiple algebraic convolutional models. We show that any RKHS induces algebras whose elements determine convolutional operators acting on RKHS elements. This approach allows us to achieve scalable filtering and learning as a byproduct of the convolutional model, and simultaneously take advantage of the well-known benefits of processing information in an RKHS. To emphasize the generality and usefulness of our approach, we show how algebraic RKHS can be used to define convolutional signal models on groups, graphons, and traditional Euclidean signal spaces. Furthermore, using algebraic RKHS models, we build convolutional networks, formally defining the notion of pointwise nonlinearities and deriving explicit expressions for the training. Such derivations are obtained in terms of the algebraic representation of the RKHS. We present a set of numerical experiments on real data in which wireless coverage is predicted from measurements captured by unmaned aerial vehicles. This particular real-life scenario emphasizes the benefits of the convolutional RKHS models in neural networks compared to fully connected and standard convolutional operators.
Authors: Abulikemu Abuduweili, Chenyang Yuan, Changliu Liu, Frank Permenter
The denoising process of diffusion models can be interpreted as an approximate projection of noisy samples onto the data manifold. Moreover, the noise level in these samples approximates their distance to the underlying manifold. Building on this insight, we propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold. Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process. Additionally, we extend this approach to various image restoration tasks by integrating task-specific constraints, including inpainting, deblurring, super-resolution, colorization, and compressed sensing. Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios. Notably, the proposed noise level correction framework is compatible with existing denoising schedulers (e.g., DDIM), offering additional performance improvements.
Authors: Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto
The Mice Autism Detection via Ultrasound Vocalization (MADUV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.
Authors: Ryan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève
In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.
Authors: Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Sarah de Boer, Víctor M. Campello, Aasa Feragen, Enzo Ferrante, Melanie Ganz, Judy Wawira Gichoya, Camila González, Steff Groefsema, Alessa Hering, Adam Hulman, Leo Joskowicz, Dovile Juodelyte, Melih Kandemir, Thijs Kooi, Jorge del Pozo Lérida, Livie Yumeng Li, Andre Pacheco, Tim Rädsch, Mauricio Reyes, Théo Sourget, Bram van Ginneken, David Wen, Nina Weng, Jack Junchi Xu, Hubert Dariusz Zając, Maria A. Zuluaga, Veronika Cheplygina
Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at this http URL.
Authors: Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attack. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce \textbf{Jailbreak-AudioBench}, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also various editing techniques for injecting audio hidden semantics. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs and establish the most comprehensive Jailbreak benchmark to date for audio modality. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.
Authors: Sukkeun Kim, Sangwoo Moon, Ivan Petrunin, Hyo-Sang Shin, Shehryar Khattak
This study proposes a new Gaussian Mixture Filter (GMF) to improve the estimation performance for the autonomous robotic radio signal source search and localization problem in unknown environments. The proposed filter is first tested with a benchmark numerical problem to validate the performance with other state-of-the-practice approaches such as Particle Filter (PF) and Particle Gaussian Mixture (PGM) filters. Then the proposed approach is tested and compared against PF and PGM filters in real-world robotic field experiments to validate its impact for real-world applications. The considered real-world scenarios have partial observability with the range-only measurement and uncertainty with the measurement model. The results show that the proposed filter can handle this partial observability effectively whilst showing improved performance compared to PF, reducing the computation requirements while demonstrating improved robustness over compared techniques.
Authors: Alberto Padoan, Jeremy Coulson
The paper introduces a class of distances for linear behaviors over finite time horizons. These distances allow for comparisons between finite-horizon linear behaviors represented by matrices of possibly different dimensions. They remain invariant under coordinate changes, rotations, and permutations, ensuring independence from input-output partitions. Moreover, they naturally encode complexity-misfit trade-offs for Linear Time-Invariant (LTI) behaviors, providing a principled solution to a longstanding puzzle in behavioral systems theory. The resulting framework characterizes modeling as a minimum distance problem, identifying the Most Powerful Unfalsified Model (MPUM) as optimal among all systems unfalsified by a given dataset. Finally, we illustrate the value of these metrics in a time series anomaly detection task, where their finer resolution yields superior performance over existing distances.
Authors: Xinquan Wang, Fenghao Zhu, Chongwen Huang, Zhaohui Yang, Zhaoyang Zhang, Sami Muhaidat, Chau Yuen, Mérouane Debbah
Large language models (LLMs) face significant challenges in specialized domains like telecommunication (Telecom) due to technical complexity, specialized terminology, and rapidly evolving knowledge. Traditional methods, such as scaling model parameters or retraining on domain-specific corpora, are computationally expensive and yield diminishing returns, while existing approaches like retrieval-augmented generation, mixture of experts, and fine-tuning struggle with accuracy, efficiency, and coordination. To address this issue, we propose Telecom mixture of models (TeleMoM), a consensus-driven ensemble framework that integrates multiple LLMs for enhanced decision-making in Telecom. TeleMoM employs a two-stage process: proponent models generate justified responses, and an adjudicator finalizes decisions, supported by a quality-checking mechanism. This approach leverages strengths of diverse models to improve accuracy, reduce biases, and handle domain-specific complexities effectively. Evaluation results demonstrate that TeleMoM achieves a 9.7\% increase in answer accuracy, highlighting its effectiveness in Telecom applications.
Authors: Amna Irshad, Emil Björnson, Alva Kosasih, Vitaly Petrov
Movable antennas represent an emerging field in telecommunication research and a potential approach to achieving higher data rates in multiple-input multiple-output (MIMO) communications when the total number of antennas is limited. Most solutions and analyses to date have been limited to \emph{narrowband} setups. This work complements the prior studies by quantifying the benefit of using movable antennas in \emph{wideband} MIMO communication systems. First, we derive a novel uplink wideband system model that also accounts for distortion from transceiver hardware impairments. We then formulate and solve an optimization task to maximize the average sum rate by adjusting the antenna positions using particle swarm optimization. Finally, the performance with movable antennas is compared with fixed uniform arrays and the derived theoretical upper bound. The numerical study concludes that the data rate improvement from movable antennas over other arrays heavily depends on the level of hardware impairments, the richness of the multi-path environments, and the number of subcarriers. The present study provides vital insights into the most suitable use cases for movable antennas in future wideband systems.
Authors: Nikolaos Louloudakis, Ajitha Rajan
With over 700 stars on GitHub and being part of the official ONNX repository, the ONNX Optimizer is the default tool for applying graph-based optimizations to ONNX models. Despite its widespread use, its ability to maintain model accuracy during optimization has not been thoroughly investigated. In this work, we present OODTE, a utility designed to automatically and comprehensively evaluate the correctness of the ONNX Optimizer. OODTE adopts a straightforward yet powerful differential testing and evaluation methodology, which can be readily adapted for use with other compiler optimizers. Specifically, OODTE takes a collection of ONNX models, applies optimizations, and executes both the original and optimized versions across a user-defined input set, automatically capturing any issues encountered during optimization. When discrepancies in accuracy arise, OODTE iteratively isolates the responsible optimization pass by repeating the process at a finer granularity. We applied OODTE to 130 well-known models from the official ONNX Model Hub, spanning diverse tasks including classification, object detection, semantic segmentation, text summarization, question answering, and sentiment analysis. Our evaluation revealed that 9.2% of the model instances either caused the optimizer to crash or led to the generation of invalid models using default optimization strategies. Additionally, 30% of classification models and 16.6% of object detection and segmentation models exhibited differing outputs across original and optimized versions, whereas models focused on text-related tasks were generally robust to optimization. OODTE uncovered 15 issues-14 previously unknown-affecting 9 of 47 optimization passes and the optimizer overall. All issues were reported to the ONNX Optimizer team. OODTE offers a simple but effective framework for validating AI model optimizers, applicable beyond the ONNX ecosystem.