Technical Skills

  • ML/DL | CNN, RNN, LSTM, Transformer, Seq2Seq, Self-Supervised Pre-Training, Multitask Learning, Domain Adaptation, GBDT, SVM, Ensemble Methods.
  • DL for NLP | Language Modeling, Embedding, Text Classification, Question Answering, Sequence Labeling, Dialogue Systems.
  • DL for Multimodality | Visual-Language Pre-Training, Document Understanding, Screen Understanding, Vision Transformer, Visual Question Answering.
  • Optimization | Linear/Nonlinear/Dynamic Programming, Convex Optimization.
  • Signals & Systems | LTI Signal Processing, Linear/Nonlinear Dynamical Systems, Digital Signal Processing, Image Processing.
  • Others | Information Retrieval, Image Processing, Signal Processing, Numerical Linear Algebra, Numerical Analysis, Parallel and Distributed Computing.

Software Skills

  • Programming Languages | Python, C++, C#, Java, JavaScript, Matlab, SQL
  • DL | PyTorch, TensorFlow, JAX, Pax, Hugging Face.
  • ML | scikit-learn, xgboost, NumPy, SciPy.
  • NLP | Stanford CoreNLP, SpaCy, NLTK, genism, Solr/Lucene
  • CV | OpenCV, Pillow
  • Big Data | Hadoop, Beam, MapReduce, Hive, Presto, etc.
  • Parallel Computing | OpenMP, MPI.
  • Web Frameworks | Django, Flask, Angular.
  • Numerical Computation | Eigen, BLAS, LAPACK.
  • Virtualization and Deployment | Docker, RHEL, Ubuntu, AWS, Azure, Google Cloud.


ScreenQA: Question-Answer Pairs Over Mobile App Screenshots

We present a new task and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 80,000+ question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity.

The challenge of this task lies in three aspects: 1) Screenshots are a natural mixture of text, symbols, and images in a layout. The model is expected to understand interactions between these modalities. 2) UI elements are not precisely given. Although the structural information such view hierarchies and DOM trees, may be available, they typically exhibit noticeable rendering issues (right of the image). 3) The complete answer may come from multiple parts from the screen (left of the image). The task requires the model to include all pieces to get a full score.


Multimodal LLM for Document and Screen Understanding

Scanned document images and screenshots share a common characteristic: Both of these image pixels contain text phrases arranged in a non-sequential manner. To build a foundation model that works for various downstream tasks, we need to inject multiple modalities, such as text phrases, coordinates of the phrases, the corresponding image crops, and the overall image pixels, in order to provide comprehensive understanding and representation learning.

To this aim, we integrate various state-of-the-art architectures: From language modeling, we investigate Encoder-Decoder/Decoder-Only LLMs. From document understanding, Attention Bias develops the awareness of 2D layout alignment. From image encoding, ResNet, DETR, Vision Transformer (ViT), etc., are explored. From modality fusion, Perceiver, Flamingo, Prompt Tuning, Cross Attention are studied. The goal is simple: to provide the best model that works well for all types of Document and Screen Understanding tasks.

Multi-Span Question Answering

The state-of-the-art closed-domain question answering problem like SQuAD expects a single-span answer from the closed-domain context. However, in a more natural setting, a question typically comes first. Then the relevant document is retrieved by a search engine to pair up the question as the context. In such a case, the answer usually scatters all over the context as multiple phrases. An example can be seen from the image above. This is a step forward toward a semi-open-domain question answering system.

This project utilized BiLSTM with Co-Attention and Self-Attention mechanisms, implemented in PyTorch.

Deep Learning for Ranking

Document relevance ranking, recommendation, and potentially click-through rate prediction can be all accomplished through NLP deep learning. The canonical neural network design is to start with a word- and/or character-level embedding, followed by layers of RNN and Attention mechanisms that fuse multiple input signals together (typically query and document), and end with a metric of the ranking loss.

This approach is significant more accurate than traditional information retrieval methods, such as TFIDF or BM25, at a cost of much higher computational complexity. How to strive for a balance between these two types of methods is the key for a practical design.

Text Classification for Transaction Data

The text in credit card transaction descriptions are largely ungrammatical, with scrambled word ordering and inconsistent space delimiters (e.g., whitespace, asterisk, and/or a mixture of both). This poses challenges in proper tokenization. Examples can be seen above.

In this case, RNN or Transformers may be of little help but just computationally expensive. A better trade-off can be achieved using a traditional NLP method, such as bag-of-words featurization on letter n-grams, followed by classic multi-class classification methods. This method is patented here:


GitHub Repo Recommender System

GitHub is great except for navigating yourself on the site. Back in 2014 when this project was conducted, the GitHub repo recommendation functionality left a lot to be desired (it is now improved very much compared to then). In light of this, I designed and implemented a GitHub recommender system based on README keyword similarities. The relevant repos are precomputed and clustered by the affinity propagation method to enable real-time retrieval upon users' request.

The overall recommender system is implemented in Python and was served as a web application.

Slides. GitHub.

Point Cloud Object Recognition

The point cloud image representation has been growing its popularity due to the recent increases of LIDAR signal applications from the autonomous driving and robotics industry. To recognize the objects from points directly (without projecting points onto pre-defined grids), one recent breakthrough is the work "PointNet," which relies on symmetric functions whose computation is independent of data orders and all points can then be treated as an unordered set. However, the original work of PointNet does not utilize the proximity information between points and seems to have room for improvement.

I proposed an augmentation method "super points," a representation of point clusters, to help the neural net recognize point proximity information. This improves the prediction accuracy against the ModelNet40 benchmark.


Automated Modeling of Nonlinear Dynamical Systems

This project is to design a black-box learning algorithm for nonlinear dynamical system modeling, with the aim to enable hierarchical, bottom-up modeling strategy. Prior to this work, the existing modeling methods for nonlinear dynamical systems target only the end-to-end modeling scenario, which implies that any small changes in one component of the system requires a whole model retraining. A more practical approach is to build a model of each component of the system one at a time and then construct the whole system by composing component models. However, composing models implicitly creates feedback loops and almost always unstabilizes the overall system model.

I developed a convexified theoretical criterion such that the composed systems are numerically stable and implemented a convex optimization program. This method has been successfully applied to nonlinear circuits, human arterial networks, and finance time-series stock price prediction.

Thesis. GitHub.

CAPLET: Parallelized Capacitance Extraction Toolkit

CAPLET is an open-source capacitance extraction toolkit that computes field-solver accurate solutions using our developed ultra parallel-efficient algorithm, which is built on top of empirical instantiable basis functions for the boundary element method.

Our freely released software covers the complete extraction flow from GDS2 layout files to capacitance matrices. This is, to the best of our knowledge, the first open-source extraction tool that incorporates geometry algorithms, basis function generation, 3D layout visualization, and parallelized field solver extraction altogether. CAPLET provides both intuitive GUI and command line interfaces. The underlying algorithm is specialized for VLSI interconnects in Manhattan geometries inside a uniform dielectric material.

Website. GitHub.