Information Retrieval
These are notes from my readings related to my work on a search engine (documents search, not web)
Last updated
These are notes from my readings related to my work on a search engine (documents search, not web)
Last updated
Some terminology encountered in the NLP research related to search:
TREC 2020 - Deep Learning Track
this is a must-check for latest trends
it has its own dataset with its own particularities (make sure to read about the dataset for the particular year)
they publish an overview yearly
the track has two tasks: document retrieval and passage retrieval
TREC 2020 overview [1]:
rankers with BERT-style pretraining outperform other rankers in the large data regime
same as in Deep Learning Track 2019, not yet seeing a strong advantage of “fullrank” over “rerank”; but full ranking is believed to eventually outperform rerank
trends, compared to previous year:
less submissions for more traditional neural network models
more for language models (BERT, XLNet [2])
MS MARCO
dataset released by Microsoft (Bing search queries)
same two tasks, but called: document ranking and passage ranking
submissions can de made anytime (not an yearly event like TREC)
can check the ranking of submitted models on the official website
interestingly, for passage ranking first 10 results are currently full ranking models (not reranking)
some groups make their code available
[1] Craswell, Nick & Mitra, Bhaskar & Yilmaz, Emine & Campos, Daniel. (2021). Overview of the TREC 2020 deep learning track. https://arxiv.org/pdf/2102.07662.pdf
[2] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019b.
How I think search UI should look like.
And also:
CIRCLE (Joint Conference of the Information Retrieval Communities in Europe) - started in 2020
Particularly found interesting proceedings from: SIGIR and ACL
SIGIR:
publishes a comprehensive booklet with all papers, short papers and tutorials
some of the tutorials are in the form: overview of the recent past + state of the art on a specific topic
some tutorials get published as books afterwards (some freely available on research publications' websites)
found SIGIR to be a good source of what's new / the trend in information retrieval (IR) year after year
ACL:
slides and video tutorials available on the website
Who is publishing valuable resources
Scientific articles:
Research labs
Large companies: Found several relevant articles from: LinkedIn, Ebay
Blog articles:
Large companies:
OLX has some interesting posts on IR on Medium
Etsy Medium post on building an autosuggest corpus (actually, ex-Etsy, but it draw on lessons learned at Etsy)
Small companies:
For example, Building a medical search engine by Posos: https://medium.com/posos-tech/building-a-medical-search-engine-step-1-medical-word-embeddings-ec9b13e1870d
Where to start a search:
SemanticScholar, search for a review/book: https://www.semanticscholar.org/
Proceedings of SIGIR (check yearly booklet with title and authors of all that was presented
most of the articles submitted to SIGIR are freely available on arXiv.org
the most informative resource on SOTA I found was from the proceedings of SIGIR 2020
Proceedings of TREC (see slide on Competitions and trends)
Motivation behind autocomplete
The autocomplet is extra work. We should have strong reasons to add it.
User Experience
familiarity (we expect this from a search engine)
2. To speed up the process of text input
less typing
3. Helping the user write better queries before firing the search
mismatch between words used in the query and the user's intent
exploration phase in search: the user may not know what exactly they're looking for
Sources of information for autocomplete
Query logs
past searched of this user
past searches of all users
considerations
user privacy
mistakes (spelling errors)
Documents
extract statistical patterns from the corpus of documents
considerations:
users and maker of corpus may not use the same words to refer to the same thing
Information about the user
location
language
A matter of choice
no right/wrong answer
choose based on needs and available data
if both options possible (query logs and info from documents), compare their outputs or run live AB tests
How to build the autocomplete corpus
Three common options:
Ad-Hoc Heuristics
hand-tune some rules for selecting / creating relevant phrases from the text corpus
use some metrics of frequency at n-gram level (TF-IDF, for example)
apply further restrictions, like keeping only specific part of speech, like nouns
use simple statistics methods to detect & remove chance associations for n-grams (nltk package)
2. Supervised Learning
label some data
train a model (binary classification, 1 / 0) to extract relevant phrases
3. Unsupervised Learning
for example, TextRank for keyword and sentence extraction [1]
How to evaluate an autocomplete system
Acceptance rate
Every new character type is a rejection
The rank of the accepted suggestion
what position it occupied in the list of suggestions
Average keystrokes
how many characters the user had to type before they accepted a suggestion
Think of a way to detect "fake positives":
the user may have accepted a suggested, only to realise it did not lead to the expected results and resumed the search / typing
Trends (historical order, newest trends at the bottom)
Approach 1: "Exact match"
very successful example: Okapi BM25 - a bag-of-words retrieval function - ranks documents based on the query terms appearing in each document
ElasticSearch used TF-IDF until 2016, when it switched to BM25
Enhancement: method to enrich query and / or document representations
Objection: lexical match; no semantic consideration (i.e. for synonymy, paraphrase, term variation, and different expressions of similar intents)
Approach 2 : "Learning to rank"
manually create features for documents (i.e. BM25 scores between query and various document fields, frequency of co-occurrence of a term pair, etc)
train ML models (supervised learning) on these hand crafted features
A real-world search engine could have hundreds of features
popular between 1980s - 2010s, lost popularity due to the success of decision trees and hype around deep learning
Approach 3 : "Deep Learning" pre-BERT
2 main advantages:
no more exact match (instead, continuous vector representations)
no more need for laborious hand crafted features
e.g. popular: word2vec (Le and Mikolov et al 2014)
wide range of neural architectures were explored: CNN, RNN, LSTM
gained popularity due to its success in computer vision
study found traditional neural ranking models ≈ "exact match" models in the context of limited training data (most real world situations)[2][3]
Approach 4
"Deep Learning" BERT-based
TREC 2019 - Deep Learning Track - BERT models performed better (different teams, different tasks) [1]
[1] Pretrained Transformers for Text Ranking: BERT and Beyond, Jimmy J. Lin, Rodrigo Nogueira, Andrew Yates, NAACL, 2021. https://arxiv.org/abs/2010.06467
[2] W. Yang, K. Lu, P. Yang, and J. Lin. Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), pages 1129–1132, Paris, France, 2019b.
[3] A. Yates, S. Arora, X. Zhang, W. Yang, K. M. Jose, and J. Lin. Capreolus: A toolkit for end-to-end neural ad hoc retrieval. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 861–864, 2020.
Okapi BM25
first time used in IR in 2019: rank passages from web pages with respect to user's natural language search queries [1] - on MS MARCO passage retrieval test collection [2]
important advantage: many studies showed that with pre-trained transformer models, large amounts of relevance judgments were not necessary to build effective models for text ranking.
two common approaches for using BERT models in search - using the terminology from [3], these are :
multi-stage ranking architectures
learning dense representations
Why switching to a BERT-based model to begin with :
the possible need of less training data - because we use BERT models pre-trained on huge amounts of data and then fine-tune them for our task; this approach was found to need considerably less training data specific to the task at hand
best results in information retrieval public competitions : BERT - based models
Using a pre-trained model:
fine tuning on our domain-specific data
or use directly - research found worse results with this approach
[1] J. Lin, R. Nogueira, A. Yates. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2010.06467v1 [cs.IR] 13 Oct 2020 https://arxiv.org/pdf/2010.06467.pdf
[2] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. MS MARCO: A Human Generated MAchine Reading Comprehension Dataset. https://arxiv.org/abs/1611.09268
[3] Pretrained Transformers for Text Ranking: BERT and Beyond, Jimmy J. Lin, Rodrigo Nogueira, Andrew Yates, NAACL, 2021. https://arxiv.org/abs/2010.06467
Multi-stage ranking / cross-encoder
this architecture was used since the first applications of BERT to search ranking (2019)
receives both the query and the document as inputs: pair (q,d)
number of parameters to learn
at prediction time, it will be applied to all candidate documents
large corpus of documents => too slow to predict for all documents => a list of candidate documents is assembled with a fast and easy algorithm and model predicts only for these documents
Learned Dense Representations / bi-encoder
this architecture has weaker results at first, but gradually improving and research is strongly focusing on this starting with 2020
the problem can be formulated as:
classification (predict if query q is relevant / irrelevant for document d)
similarity (compute embedding for q and perform kNN search in documents vector space)
advantages over cross-encoder:
encodings for corpus of documents can be computed offline and stored => at prediction time, the model is only applied to query string q => faster prediction
no need for precomputed list of candidate documents
Multi-stage ranking / cross-encoder OR Learned Dense Representations / bi-encoder
The answer depends on:
the process of making up a candidate list ( speed / quality of results)
the inference time of multi-stage architecture (which should be slower than learned dense representations architecture)
Preferred approach:
start with multi-stage (it was researched for a longer time, more resources available for documentation)
if multi-stage is suboptimal (depending on what we want to optimise), proceed to learned dense representations (more recent, possibly less documentation, could be more difficult to obtain a satisfactory model)
Inference speed :
there are methods to increase speed (if speed turns out too low) :
distillation
weights quantisation
Text length :
BERT models have a max input sequence length : 512 tokens (a token is not a word, actually a word can become several tokens when a text is converted to the proper input form for a BERT based model)
workarounds :
input truncation is a popular method
or slitting a document into small chunks
Intent detection :
BERT has been used also for more accurate intent detection: LinkedIn used BERT to predict what class of document the user search is targeting (user profiles / user feeds / job posts etc) [1]