Jonghyun Yun is a data scientist in Institute of Statistical Data Science. He completed his PhD in the Department of Statistics at University of Illinois at Urbana-Champaign under the guidance of Dr. Yuguo Chen. He was an Assistant Professor in the Department of Mathematics at UT Arlington. Prior to joining UTA, he was an Assistant Professor in the Department of Mathematical Sciences and a core faculty in the Border Biomedical Research Center at the University of Texas at El Paso, from 2015 to 2016. He was a postdoctoral researcher at UT Southwestern Medical Center under the guidance of Drs. Guanghua Xiao and Yang Xie from 2012 to 2015.
Ph.D. in Statistics, 2012
University of Illinois at Urbana–Champaign
MA in Applied Statistics, 2006
BAs in Applied Statistics and Business Administration, 2004
This study analyzes cost overrun occurrence (COO) in the context of socioeconomic conditions leveraging machine learning techniques and geographic information systems due to little information about the relationship between SE factors and cost overruns in transportation infrastructure improvement projects. We extract socio-geospatial features in multiple sources of data sets and establish a random forest model to discover their associations with COO. The developed models reveal highly significant features affecting COO, which include original amounts, original duration, management districts, number of lanes, population over 16-years-old, commuting behavior, industrial topography, and average temperature, indicating that socioeconomic conditions play an important role in actual project expenses. Our findings will assist practitioners and decision-makers to better forecast and reflect the likely impacts of socioeconomic conditions surrounding the project in their planning, budgeting, and operation and maintenance. The software for the statistical analysis can be found in github.com/jonghyun-yun/dico.
This study presents a random shape aggregate model by establishing a functional mixture model for images of aggregate shapes. The mesoscale simulation to consider heterogeneous properties concrete is the highly cost- and time-effective method to predict the mechanical behavior of the concrete. Due to the significance of the design of the mesoscale concrete model, the shape of the aggregate is the most important parameter to obtain a reliable simulation result. We propose image analysis and functional data clustering for random shape aggregate models (IFAM). This novel technique learns the morphological characteristics of aggregates using images of real aggregates as inputs. IFAM provides random aggregates across a broad range of heterogeneous shapes using samples drawn from the estimated functional mixture model as outputs. Our learning algorithm is fully automated and allows flexible learning of the complex characteristics. Therefore, unlike similar studies, IFAM does not require users to perform time-consuming tuning on their model to provide realistic aggregate morphology. Using comparative studies, we demonstrate the random aggregate structures constructed by IFAM achieve close similarities to real aggregates in an inhomogeneous concrete medium. Thanks to our fully data-driven method, users can choose their own libraries of real aggregates for the training of the model and generate random aggregates with high similarities to the target libraries.
Recently, the Hamilton Monte Carlo (HMC) has become widespread as one of the more reliable approaches to efficient sample generation processes. However, HMC is difficult to sample in a multimodal posterior distribution because the HMC chain cannot cross energy barrier between modes due to the energy conservation property. In this paper, we propose a Stochastic Approximate Hamilton Monte Carlo (SAHMC) algorithm for generating samples from multimodal density under the Hamiltonian Monte Carlo (HMC) framework. SAHMC can adaptively lower the energy barrier to move the Hamiltonian trajectory more frequently and more easily between modes. Our simulation studies show that the potential for SAHMC to explore a multimodal target distribution is more efficient than HMC-based implementations.
Particle filters have been widely used for online filtering problems in state–space models (SSMs). The current available proposal distributions depend either only on the state dynamics, or only on the observation, or on both sources of information but are not available for general SSMs. In this article, we develop a new particle filtering algorithm, called the augmented particle filter (APF), for online filtering problems in SSMs. The APF combines two sets of particles from the observation equation and the state equation, and the state space is augmented to facilitate the weight computation. Theoretical justification of the APF is provided, and the connection between the APF and the optimal particle filter (OPF) in some special SSMs is investigated. The APF shares similar properties as the OPF, but the APF can be applied to a much wider range of models than the OPF. Simulation studies show that the APF performs similarly to or better than the OPF when the OPF is available, and the APF can perform better than other filtering algorithms in the literature when the OPF is not available.
The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA-protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA-protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA-protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this study, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA-protein binding sites from PAR-CLIP data.