Zhengxiang Wang

I am a PhD student in Computational Linguistics at Stony Brook University (since Fall 2022) and a member of the Institute for Advanced Computational Science (IACS). My research interests are in Natural Language Processing, Computational Linguistics, Computational Social Science, Machine Learning, and Formal Language Theory. My advisors are Dr. Owen Rambow and Dr. Jeffrey Heinz.

Over years, I have worked on projects that study actual language use at large scale, data augmentation, as well as how and how well neural networks learn and generalize. I am currently working on understanding the capabilities of Large Language Models (LLMs) through probing.

I am also a trainee (since May 2023) for the NSF BIAS-NRT project led by Dr. Susan Brennan that aims to detect and address bias in data, humans, and institutions.

Bio

I was born and raised in Fuqing, a small southeastern town of China. Prior to coming to Stony Brook, I completed a bachelor's degree in Chinese Language and Literature from Hunan University, and a master's degree in Applied linguistics from University of Saskatchewan.

I am a proud self-taught and self-motivated programmer. I started learning programming in 2020, and have since managed to make programming relevant to and then part of my daily life. Looking back, I am glad to find my experiences with NLP align well with the three major phases of the field featured as: rule-based (symbolic) methods, statistical machine learning, and deep learning. I also find myself fortuante to witness the rapid development of the field over the past few years.

While I am not doing research, I enjoy reading random stuffs and exploring new places.

CV

Here is my Curriculum Vitae.

Research

For a full and up-to-date list of publications, please check my Google Scholar page.

Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents
Zhengxiang Xiang, Owen Rambow, arXiv preprint arXiv:2402.17151

Learning Transductions and Alignments with RNN Seq2seq models
Zhengxiang Xiang, ICGI 2023

Developing literature review writing skills through an online writing tutorial series: Corpus-based evidence
Zhi Li, Makarova Veronika, Zhengxiang Wang, Frontiers in Communication, 2023

Random Text Perturbations Work, but not Always
Zhengxiang Wang, AACL-IJCNLP 2022 Workshop Eval4NLP

Thirty-Two Years of IEEE VIS: Authors, Fields of Study and Citations
Hongtao Hao, Yumian Cui, Zhengxiang Wang, Yea-Seul Kim, IEEE Transactions on Visualization and Computer Graphics, 2022

Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching
Zhengxiang Wang, ICNLSP 2022

A macroscopic re-examination of language and gender: a corpus-based case study in the university classroom setting
Zhengxiang Wang, MA thesis, University of Saskatchewan [Slides], 2021

Resources

Deep Learning

PyTorch Tutorial : simple PyTorch Tutorial for a guest lecture I gave, suitable for beginners
RNN Seq2seq transduction : customized pipeplines to model language transduction tasks using RNN seq2seq models
RNN transduction : customized pipeplines to model language transduction tasks using RNNs
Text matching explained & Text classification explained : building and training deep learning models for text (matching) classification tasks from scratch using paddle, PyTorch, and TensorFlow.
Notes for Stanford CS224N : Natural Language Processing with Deep Learning.
Hands on gradients derivations tutorials for common machine learning loss functions.
Deep-learning-based Natural Language Processing using paddlenlp : covering a wide range of essential NLP tasks (both classification and non-classification) for industry and the SOTA practices.
Word embedding resources, application, visualization, and training (word2vec in python).

Text Processing

Text augmentation techniques : from random text-editing perturbations, back translation, to model-based transformations. Also see: data augmentation programs (plus ngram language model).
Historical English Language Processing Toolkit : An efficient toolkit and a general framework for early modern & modern English Language Processing (multi-label annotation) in XML.
Linguistic Feature Extractor : A corpus-linguistic tool to extract and search for linguistic features (with 95 builtin features), which generates both feature statistics and the extracted instances.
Unfilled Pause Classifier : a rule-based syntactic parser classifying unfilled pauses based in the British Academic Spoken English corpus.

Miscellaneous

Lstar Python : Python Implementation of the Lstar Algorithm by Angluin (1987).
Google Scholar Analyzer : Auto-aggregating academic profiles of researchers on Google Scholar.
YouTube Info Collector : An interface to scrape information (video titles, post dates, view counts, like counts, and comments etc.) from YouTube videos based on queries, video links, or channel links.

Chinese-related

Gender predictor : Predicting gender of given Chinese names with over 93% (up to 99%) test set accuracy using Naive Bayes, multi-class Logistic Regression, neural networks models.
CCNC : A Comprehensive Chinese Name Corpus (3.65M unique name samples).
Chinese Ngrams Counts : character-based and word-based from large-scale corpora.
Corpus of Chinese synonyms : from multiple reputable sources with over 70k base examples.
Corpus of Chinese fixed phrases and idioms : rich dictionary-like accounts for 30310 instances.