I am a recent MA graduate in Applied Linguistics from the University of Saskatchewan. My main research interests are in natural language processing (NLP), corpus linguistics, and computational social science. I employ both computational and linguistic approaches to studying the properties of languages and their linguistic and social meanings inferred from actual language use in large-scale linguistic corpora. I am also highly interested in how neural networks encode linguistic knowledge and "learn" human languages.
Inspired by Andrew Ng, I believe in the prospect of data-centric AI (versus model-centric AI). With my research, I aim to leverage both linguistic domain knowledge and learning algorithms to help train more robust NLP models with data that is either much smaller in size or can be obtained with much less manual efforts. Only in this way can we advance computational linguistics to understudied text and language domains and bring more practical values to the real world.
I was born and raised in Fuqing, a small southeastern city of China. From 2015 to 2019, I studied at Hunan University for my bachelor's degree. Majoring in Chinese Language and Literature, I found my passion for linguistics in the first year of university and managed to undertake a two-year funded project on Chinese grammar and a three-month psycholinguistic internship in Canada. Besides literature- & linguistics- related research, I also spent time doing research in law, history, social science, and a bit of philosophy.
From 2019 to 2021, I studied at the University of Saskatchewan, majoring in Applied Linguistics. I taught myself programming and computational linguistics and focused on quantitative and data-driven analysis of language use in transcribed linguistic corpora. From purely rule-based programming, exemplified by extensive regular expressions for syntactic parsing, text extraction, and corpus annotation, I steadily became fluent in building NLP models with statistical machine learning and deep learning. As I am new to this area, I welcome any like-minded people to reach out!
My full CV.
Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching
Zhengxiang Wang, ArXiv:2111.14709v1 [cs.CL], 2021
About: As an effort to data-centric NLP, I explored the role of probabilistic linguistic knowledge in data augmentation for a binary Chinese question matching classification task. You can check out the source code, data, experimental results, and recent updates in this repository .
A macroscopic re-examination of language and gender: a corpus-based case study in the university classroom setting
MA thesis, University of Saskatchewan, 2021
About: The thesis compared the use of 87 syntactic and lexical linguistic features by university male and female instructors across four academic disciplines in a data-driven manner taking the complexity of language use into account. Linguistic Feature Extractor automates the feature extraction.
Grammar as science: Rethinking the construction of Modern Chinese Grammar
A funded research project, Hunan University, 2017-2019
About: 11 essays/papers (in Chinese, 99 pages) with one published , containing both synchronic and diachronic studies of several Chinese grammatical phenomena. Topics include: parts of speech classification, lexical reduplication, grammaticalization, negators, and discourse particles etc.
- Text matching explained & Text classification explained : building and training deep learning models for text (matching) classification tasks from scratch using paddle, PyTorch, and TensorFlow.
- Notes for Stanford CS224N : Natural Language Processing with Deep Learning.
- Hands on gradients derivations tutorials for common machine learning loss functions.
- Deep-learning-based Natural Language Processing using paddlenlp : covering a wide range of essential NLP tasks (both classification and non-classification) for industry and the SOTA practices.
- Word embedding resources, application, visualization, and training (word2vec in python).
- Text augmentation techniques : from random text-editing perturbations, back translation, to model-based transformations. Also see: data augmentation programs (plus ngram language model).
- Historical English Language Processing Toolkit : An efficient toolkit and a general framework for early modern & modern English Language Processing (multi-label annotation) in XML.
- Linguistic Feature Extractor : A corpus-linguistic tool to extract and search for linguistic features (with 95 builtin features), which generates both feature statistics and the extracted instances.
- Unfilled Pause Classifier : a rule-based syntactic parser classifying unfilled pauses based in the British Academic Spoken English corpus.
- Google Scholar Analyzer : Auto-aggregating academic profiles of researchers on Google Scholar.
- YouTube Info Collector : An interface to scrape information (video titles, post dates, view counts, like counts, and comments etc.) from YouTube videos based on queries, video links, or channel links.
- Gender predictor : Predicting gender of given Chinese names with over 93% (up to 97%) test set accuracy using Naive Bayes and multi-class Logistic Regression models (tutorials included).
- CCNC : A Comprehensive Chinese Name Corpus (3.65M unique name samples).
- Chinese Ngrams Counts : character-based and word-based from large-scale corpora.
- Corpus of Chinese synonyms : from multiple reputable sources with over 70k base examples.
- Corpus of Chinese fixed phrases and idioms : rich dictionary-like accounts for 30310 instances.