When annotators disagree, that disagreement can reflect epistemic uncertainty rather than simple label noise. We study hard-label delivery as an alternative to the usual choices of collapsing votes to a single label or training directly on the empirical soft-label distribution. We focus on two primary hard-label methods: multipass, which cycles through observed votes while keeping the dataset size fixed, and stochastic label sampling (SLS), which samples one label per example at the start of each epoch. On CIFAR-10H, we find that when only a small number of annotations per example is available, hard-label delivery improves over soft-label training, with larger improvements where the sparse empirical target is farther from the full annotator distribution. When full annotator distributions are available, both hard-label methods match soft-label training. We use deterministic control as an ablation of multipass and shuffled SLS as a control that breaks the example-to-distribution match. We also show that SLS and soft-label cross-entropy optimize the same expected objective. Hard-label delivery also converges to flatter basins, with supporting descriptive evidence from OOD detection on SVHN and CIFAR-100. Overall, these results suggest that multipass is a strong practical default when raw vote counts are available, while SLS offers a lightweight alternative that remains competitive when only a few votes per example are available and matches soft-label training when full annotator distributions are available.
@misc{gheibi2026targetdifferentbasinshard,title={Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions},author={Gheibi, Mirerfan and Ghazizadeh, Gashin},year={2026},eprint={2605.20642},archiveprefix={arXiv},primaryclass={cs.LG},}
We develop a computer vision system to help biologists detect endangered whales. Given access to a limited dataset of aerial imagery (1544 images of mainly water), we implemented object detection and semantic segmentation models. For segmentation, we leverage the extreme data imbalance by introducing an elliptic annotation mechanism mitigating the need for tight annotations while still constrained by expert annotators’ available time. Data scarcity made zero-false-negative rate infeasible, so we minimized false negatives while having few enough false positives that it could still help an expert annotator accelerate the annotation process itself. This would allow a bootstrapping dataset creation approach: collecting increasingly larger datasets in parallel with training increasingly accurate models. We evaluated performance for the downstream bootstrapping task with an AI-in-the-Loop experiment. Motivated by the expert user’s workflow, this required developing a feature-based clustering visualization of the images. Our segmentation system admitted few false negatives and was more efficient than manually data collection alone. While the proposed approach cannot entirely solve the challenge of the extremely small dataset, it suggests that a slightly larger dataset (e.g. adding 100 whale images would double the relevant training set) may be sufficient to bootstrap the training and collection with effectively no false negatives.
@mastersthesis{gheibi2021AIITL,title={Helping Biologists Find Whales: AI-in-the-Loop Support for Environmental Dataset Creation},author={Gheibi, Mirerfan},year={2021},school={Dalhousie University}}
2020
CAIAC
CB-DBSCAN: A Novel Clustering Algorithm for Adjacent Clusters with Different Densities.
Density-based clustering is well-known for finding clusters that have different shapes and sizes, but they have unsatisfactory results on adjacent clusters with different densities. In this paper, we propose a novel algorithm that combines DBSCAN with centroid-based algorithms to address this issue. Our algorithm uses DBSCAN to form mini-clusters, which will be merged based on their density and center distances. We test the new algorithm on synthetic and real datasets to show the significant improvement in the results.
@inproceedings{ghazizadeh2020cb,title={CB-DBSCAN: A Novel Clustering Algorithm for Adjacent Clusters with Different Densities.},author={Ghazizadeh, Gashin and Gheibi, Mirerfan and Matwin, Stan},booktitle={Canadian Conference on AI},pages={232--237},year={2020},}
2019
SUT
GPU-based Acceleration of Isogeny-based Cryptography
Post-quantum cryptography, as one of the newest groups of cryptographic algorithms, is thought to be secure against most sophisticated attacks by the groundbreaking quantum computers. Isogeny-based cryptography is an appealing contender among them due to its exceptional characteristics, especially the shortest public key in key encapsulation, encryption and decryption amid the other nominees of NIST post-quantum standard. However, its high computational complexity is a significant drawback. This research aims to increase the performance of isogeny-based cryptography in the most compute-intensive part, both in throughput and latency perspectives on GPUs and CPUs, which are the most widespread off-the-shelf processors. A considerable part of computation in the isogeny-based cryptography relates to the high-degree isogeny computation. In this thesis, there are several high-performance implementations of a parallel approach to isogeny evaluation. The GPU implementation is the first of its kind and reaches up to 44 improvement in terms of throughput in comparison to the fastest software implementation of isogeny-based cryptography
@mastersthesis{gheibi2019GPUaccelCrypto,title={GPU-based Acceleration of Isogeny-based Cryptography},author={Gheibi, Mirerfan},year={2019},school={Sharif University of Technology}}
2017
IEEEE_IR
Proposing a tokenizer for Farsi words, by using regular expressions (The paper was originally written in Persian)
This abstract is translated from the original abstract of the paper, written in Persian: This paper presents a novel word tokenizer that utilizes regular expressions to split words in a given text. The tokenizer is built upon the concept of replaceability in regular expressions. The proposed method is capable of accurately recognizing and processing various elements such as Farsi and English words, symbols, and other unique expressions. The algorithm aims to effectively identify and isolate words while accounting for their respective frequencies. Consequently, the output of the system includes the processed text, a word count with repetition (Words), a distinct vocabulary count (Vocabulary), and a list that presents each word alongside its frequency of occurrence. This list is sorted alphabetically and by frequency to provide an efficient summary of the processed text. The tokenization process is crucial for natural language processing applications, and this novel tokenizer offers an effective and adaptable solution.