[IMPORTANT]: Actively working on the blog.
NOTE: Although the idea behind doing something similar to “Question as a vector” ruminated in my mind for quite some time, [1] definitely played a role in fueling the fire to write this idea and present some empirical evidence. So kudos to [1] and is a great read.
Background
Growing up in an era where deep learning was becoming a tour de force, vectors, embeddings and representational learning was an intriguing phenomenon yet it was treated as, dare I say it, ‘fad’. Introduction to Statistical Learning was the foundation that gave me an inside overview into learning algorithms, the learning process and the broader sense of interpreting the outcomes and evaluations. Deep down for everyone, universal function approximation and neural networks seemed like a fascinating concept just because of its fantastic ability to learn over any dataset, but it seemed difficult to make the transition.
So, by craft when I was advised to build machine learning models for any given dataset $D$, it was almost obvious to pull up scikit-learn
module, utilize the classifier and regressor objects and “fit” the model over $D$. Rarely, it occurred to me to use the sklearn.neural_network.MLPClassifier
object because it felt super intuitive to take hand crafted features, and a supervised learning algorithm train a classifier.
Neural networks were more or less, if everything fails pick me please category of learning algorithms most people used when it came to tabular datasets. That said, as representational learning literature grew, the evidence of text based modality became an obvious choice to seemingly take “text” and map it out to the same previous output space, without the hand crafting of features.
So, when I say Question as a vector is a fringe idea, what I mean is I am retrofitting the incredible reflection and reasoning capabilities of large language models to traditional learning algorithms.
as always, ridiculous thoughts begin at night :)
![]()
Core idea
Given a dataset $D$, which includes inputs $X$ where $X \in \mathbb{R}^n$ is a group of hand crafted features built from asking analytical or logical questions about $X$ and $Y$ is the target variable. The proposal here is to utilize LLM’s $f_{LLM}(X)$ to give us a vector $X_{LLM}$ and we utilize the composite feature vector $X_{LLM}$ to train a learning model.
Ideally, it would be easier to evaluate the effectiveness of this approach if $D$ is a textual dataset where the initial $X$ is hand crafted and $X_{LLM}$ is generated by querying any LLM on the raw text of $X_i$.
Data
Predominantly most of my work revolves around Science of Science and this idea was conceived when I was building a representational learning model to predict reproducibility of scientific articles [2]. So, using the ACMBadgesDataset
mentioned in [2] and available from [3] like so:
TBA