
I am a first-year PhD student in Linguistics at Stony Brook University. My research interests are in Natural Language Processing, Computational Linguistics, Formal Language Theory, Machine Learning, and Computational Social Sciences.
I work with linguistic corpora, both real and synthetic. My current research focus is on understanding how and how well neural networks learn and generalize in light of formal language theory and computational learning theory. I appreciate the insights from the traditional symbolic learning literature and believe that there is a tremendous benefit of utilizing these insights to examine and explain the capabilities of neural networks, particularly the prevailing large language models (or foundation models).
Inspired by data-centric AI proposed by Andrew Ng, I am also interested in finding a better and more efficient way of building robust NLP models with small data. My past research is highly applied, with an aim to understand the social meanings of actual language use in both spoken and written linguistic data.
I am currently looking for a 2023 summer ML/NLP or related internship in US or Canada.
Bio
I was born and raised in Fuqing, a small southeastern town of China. Prior to coming to Stony Brook, I completed a bachelor's degree in Chinese Language and Literature from Hunan University, and a master's degree in Applied linguistics from University of Saskatchewan.
I am a proud self-taught and self-motivated programmer. I started learning programming since 2020, and have managed to make programming relevant to and then part of my daily life. Looking back, I am glad to find my experiences with NLP align well with the three major phases of the field featured as: rule-based (symbolic) methods, statistical machine learning, and deep learning.
Feel free to reach out, if you are interested in talking with me!
CV
Here is my CV.
Research
-
Learning Transductions and Alignments with RNN Seq2seq models
Zhengxiang Wang, 2023
About: I designed and conducted comprehensive experiments to examine the capabilities of Recurrent-Neural-Network sequence to sequence (RNN seq2seq) models in learning four transduction tasks of varying complexity and that can be described as learning alignments. The generalization abilities, the role of attention, the effect of RNN variants, and task complexity are studied. -
Developing literature review writing skills through an online writing tutorial series: Corpus-based evidence
Zhi Li, Makarova Veronika, Zhengxiang Wang (*equal contribution), Frontiers in Communication-Language Sciences, 2023
About: By analyzing a cluster of linguistic features elicited from 29 L2 graduate students' writing samples over 3 months, we tried to track evidence of development in genre awareness and mastery of academic writing, which indicated the non-linear and dynamic nature of L2 learning. -
Random Text Perturbations Work, but not Always
Zhengxiang Wang, AACL-IJCNLP 2022 Workshop Eval4NLP
About: As a continuation to my research on text augmentation, I examined the effectiveness and generalizability of random text perturbations in the context of text pair classification tasks for both Chinese and English, which revealed a complex nature of text augmentation and its evaluation. -
Thirty-Two Years of IEEE VIS: Authors, Fields of Study and Citations
Hongtao Hao, Yumian Cui, Zhengxiang Wang, Yea-Seul Kim, IEEE Transactions on Visualization and Computer Graphics, 2022
About: IEEE VIS is the top-tier conference in the field of Visualization. The study marks the first effort to comprehensively examine and visualize the authors and fields of study of 3,240 VIS publications in the past 32 years. Temporal trends are also extensively investigated. -
Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching
Zhengxiang Wang, ICNLSP 2022
About: As an effort to data-centric NLP, I explored the role of probabilistic linguistic knowledge in data augmentation for a binary Chinese question matching classification task. You can check out the source code, data, experimental results, and recent updates in this repository . -
A macroscopic re-examination of language and gender: a corpus-based case study in the university classroom setting
MA thesis, University of Saskatchewan [Slides], 2021
About: The thesis compared the use of 87 syntactic and lexical linguistic features by university male and female instructors across four academic disciplines in a data-driven manner taking the complexity of language use into account. Linguistic Feature Extractor automates the feature extraction.
Resources
Deep Learning
- RNN Seq2seq transduction : customized pipeplines to model language transduction tasks using RNN seq2seq models
- RNN transduction : customized pipeplines to model language transduction tasks using RNNs
- Text matching explained & Text classification explained : building and training deep learning models for text (matching) classification tasks from scratch using paddle, PyTorch, and TensorFlow.
- Notes for Stanford CS224N : Natural Language Processing with Deep Learning.
- Hands on gradients derivations tutorials for common machine learning loss functions.
- Deep-learning-based Natural Language Processing using paddlenlp : covering a wide range of essential NLP tasks (both classification and non-classification) for industry and the SOTA practices.
- Word embedding resources, application, visualization, and training (word2vec in python).
Text Processing
- Text augmentation techniques : from random text-editing perturbations, back translation, to model-based transformations. Also see: data augmentation programs (plus ngram language model).
- Historical English Language Processing Toolkit : An efficient toolkit and a general framework for early modern & modern English Language Processing (multi-label annotation) in XML.
- Linguistic Feature Extractor : A corpus-linguistic tool to extract and search for linguistic features (with 95 builtin features), which generates both feature statistics and the extracted instances.
- Unfilled Pause Classifier : a rule-based syntactic parser classifying unfilled pauses based in the British Academic Spoken English corpus.
Web Scraping
- Google Scholar Analyzer : Auto-aggregating academic profiles of researchers on Google Scholar.
- YouTube Info Collector : An interface to scrape information (video titles, post dates, view counts, like counts, and comments etc.) from YouTube videos based on queries, video links, or channel links.
Chinese-related
- Gender predictor : Predicting gender of given Chinese names with over 93% (up to 99%) test set accuracy using Naive Bayes, multi-class Logistic Regression, neural networks models.
- CCNC : A Comprehensive Chinese Name Corpus (3.65M unique name samples).
- Chinese Ngrams Counts : character-based and word-based from large-scale corpora.
- Corpus of Chinese synonyms : from multiple reputable sources with over 70k base examples.
- Corpus of Chinese fixed phrases and idioms : rich dictionary-like accounts for 30310 instances.