Information Retrieval

These are notes from my readings related to my work on a search engine (documents search, not web)

Some terminology encountered in the NLP research related to search:

Competitions and trends

TREC 2020 - Deep Learning Track

this is a must-check for latest trends
it has its own dataset with its own particularities (make sure to read about the dataset for the particular year)
they publish an overview yearly
the track has two tasks: document retrieval and passage retrieval
TREC 2020 overview [1]:
- rankers with BERT-style pretraining outperform other rankers in the large data regime
- same as in Deep Learning Track 2019, not yet seeing a strong advantage of “fullrank” over “rerank”; but full ranking is believed to eventually outperform rerank
- trends, compared to previous year:
  - less submissions for more traditional neural network models
  - more for language models (BERT, XLNet [2])

MS MARCO

dataset released by Microsoft (Bing search queries)
same two tasks, but called: document ranking and passage ranking
submissions can de made anytime (not an yearly event like TREC)
can check the ranking of submitted models on the official website
interestingly, for passage ranking first 10 results are currently full ranking models (not reranking)
some groups make their code available

[1] Craswell, Nick & Mitra, Bhaskar & Yilmaz, Emine & Campos, Daniel. (2021). Overview of the TREC 2020 deep learning track. https://arxiv.org/pdf/2102.07662.pdf

[2] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019b.

How I think search UI should look like.

State of the art - where is information published / presented

And also:

CIRCLE (Joint Conference of the Information Retrieval Communities in Europe) - started in 2020

Particularly found interesting proceedings from: SIGIR and ACL

SIGIR:

publishes a comprehensive booklet with all papers, short papers and tutorials
some of the tutorials are in the form: overview of the recent past + state of the art on a specific topic
some tutorials get published as books afterwards (some freely available on research publications' websites)
found SIGIR to be a good source of what's new / the trend in information retrieval (IR) year after year

ACL:

slides and video tutorials available on the website

Who is publishing valuable resources

Scientific articles:

Research labs

Large companies: Found several relevant articles from: LinkedIn, Ebay

Blog articles:

Large companies:

OLX has some interesting posts on IR on Medium
Etsy Medium post on building an autosuggest corpus (actually, ex-Etsy, but it draw on lessons learned at Etsy)

Small companies:

For example, Building a medical search engine by Posos: https://medium.com/posos-tech/building-a-medical-search-engine-step-1-medical-word-embeddings-ec9b13e1870d

Where to start a search:

SemanticScholar, search for a review/book: https://www.semanticscholar.org/
Proceedings of SIGIR (check yearly booklet with title and authors of all that was presented
- most of the articles submitted to SIGIR are freely available on arXiv.org
- the most informative resource on SOTA I found was from the proceedings of SIGIR 2020
Proceedings of TREC (see slide on Competitions and trends)

Autocomplete module

Motivation behind autocomplete

The autocomplet is extra work. We should have strong reasons to add it.

User Experience

familiarity (we expect this from a search engine)

2. To speed up the process of text input

less typing

3. Helping the user write better queries before firing the search

mismatch between words used in the query and the user's intent
exploration phase in search: the user may not know what exactly they're looking for

Sources of information for autocomplete

Query logs
- past searched of this user
- past searches of all users
- considerations
  - user privacy
  - mistakes (spelling errors)
Documents
- extract statistical patterns from the corpus of documents
- considerations:
  - users and maker of corpus may not use the same words to refer to the same thing
Information about the user
- location
- language

A matter of choice

no right/wrong answer
choose based on needs and available data
if both options possible (query logs and info from documents), compare their outputs or run live AB tests

How to build the autocomplete corpus

Three common options:

Ad-Hoc Heuristics

hand-tune some rules for selecting / creating relevant phrases from the text corpus
- use some metrics of frequency at n-gram level (TF-IDF, for example)
- apply further restrictions, like keeping only specific part of speech, like nouns
- use simple statistics methods to detect & remove chance associations for n-grams (nltk package)

2. Supervised Learning

label some data
train a model (binary classification, 1 / 0) to extract relevant phrases

3. Unsupervised Learning

for example, TextRank for keyword and sentence extraction [1]

How to evaluate an autocomplete system

Acceptance rate
- Every new character type is a rejection
The rank of the accepted suggestion
- what position it occupied in the list of suggestions
Average keystrokes
- how many characters the user had to type before they accepted a suggestion
Think of a way to detect "fake positives":
- the user may have accepted a suggested, only to realise it did not lead to the expected results and resumed the search / typing

Retrieval and ranking

Trends (historical order, newest trends at the bottom)

Approach 1: "Exact match"

very successful example: Okapi BM25 - a bag-of-words retrieval function - ranks documents based on the query terms appearing in each document
ElasticSearch used TF-IDF until 2016, when it switched to BM25
Enhancement: method to enrich query and / or document representations
Objection: lexical match; no semantic consideration (i.e. for synonymy, paraphrase, term variation, and different expressions of similar intents)

Approach 2 : "Learning to rank"

manually create features for documents (i.e. BM25 scores between query and various document fields, frequency of co-occurrence of a term pair, etc)
train ML models (supervised learning) on these hand crafted features
A real-world search engine could have hundreds of features
popular between 1980s - 2010s, lost popularity due to the success of decision trees and hype around deep learning

Approach 3 : "Deep Learning" pre-BERT

2 main advantages:
- no more exact match (instead, continuous vector representations)
- no more need for laborious hand crafted features
e.g. popular: word2vec (Le and Mikolov et al 2014)
wide range of neural architectures were explored: CNN, RNN, LSTM
gained popularity due to its success in computer vision
study found traditional neural ranking models ≈ "exact match" models in the context of limited training data (most real world situations)[2][3]

Approach 4

"Deep Learning" BERT-based
TREC 2019 - Deep Learning Track - BERT models performed better (different teams, different tasks) [1]

[1] Pretrained Transformers for Text Ranking: BERT and Beyond, Jimmy J. Lin, Rodrigo Nogueira, Andrew Yates, NAACL, 2021. https://arxiv.org/abs/2010.06467

[2] W. Yang, K. Lu, P. Yang, and J. Lin. Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), pages 1129–1132, Paris, France, 2019b.

[3] A. Yates, S. Arora, X. Zhang, W. Yang, K. M. Jose, and J. Lin. Capreolus: A toolkit for end-to-end neural ad hoc retrieval. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 861–864, 2020.

Okapi BM25

BERT-based models in search

first time used in IR in 2019: rank passages from web pages with respect to user's natural language search queries [1] - on MS MARCO passage retrieval test collection [2]
important advantage: many studies showed that with pre-trained transformer models, large amounts of relevance judgments were not necessary to build effective models for text ranking.
two common approaches for using BERT models in search - using the terminology from [3], these are :
- multi-stage ranking architectures
- learning dense representations
Why switching to a BERT-based model to begin with :
- the possible need of less training data - because we use BERT models pre-trained on huge amounts of data and then fine-tune them for our task; this approach was found to need considerably less training data specific to the task at hand
- best results in information retrieval public competitions : BERT - based models
Using a pre-trained model:
- fine tuning on our domain-specific data
- or use directly - research found worse results with this approach

[1] J. Lin, R. Nogueira, A. Yates. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2010.06467v1 [cs.IR] 13 Oct 2020 https://arxiv.org/pdf/2010.06467.pdf

[2] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. MS MARCO: A Human Generated MAchine Reading Comprehension Dataset. https://arxiv.org/abs/1611.09268

[3] Pretrained Transformers for Text Ranking: BERT and Beyond, Jimmy J. Lin, Rodrigo Nogueira, Andrew Yates, NAACL, 2021. https://arxiv.org/abs/2010.06467

Retrieval process and (re)ranking

Multi-stage ranking / cross-encoder

this architecture was used since the first applications of BERT to search ranking (2019)
receives both the query and the document as inputs: pair (q,d)
- number of parameters to learn
- at prediction time, it will be applied to all candidate documents
- large corpus of documents => too slow to predict for all documents => a list of candidate documents is assembled with a fast and easy algorithm and model predicts only for these documents

Learned Dense Representations / bi-encoder

this architecture has weaker results at first, but gradually improving and research is strongly focusing on this starting with 2020
the problem can be formulated as:
- classification (predict if query q is relevant / irrelevant for document d)
- similarity (compute embedding for q and perform kNN search in documents vector space)
advantages over cross-encoder:
- encodings for corpus of documents can be computed offline and stored => at prediction time, the model is only applied to query string q => faster prediction
- no need for precomputed list of candidate documents

Multi-stage ranking / cross-encoder OR Learned Dense Representations / bi-encoder

The answer depends on:

the process of making up a candidate list ( speed / quality of results)
the inference time of multi-stage architecture (which should be slower than learned dense representations architecture)

Preferred approach:

start with multi-stage (it was researched for a longer time, more resources available for documentation)
if multi-stage is suboptimal (depending on what we want to optimise), proceed to learned dense representations (more recent, possibly less documentation, could be more difficult to obtain a satisfactory model)

Common metrics in Information Retrieval

Inference speed :

there are methods to increase speed (if speed turns out too low) :
- distillation
- weights quantisation

Text length :

BERT models have a max input sequence length : 512 tokens (a token is not a word, actually a word can become several tokens when a text is converted to the proper input form for a BERT based model)
workarounds :
- input truncation is a popular method
- or slitting a document into small chunks

Intent detection :

BERT has been used also for more accurate intent detection: LinkedIn used BERT to predict what class of document the user search is targeting (user profiles / user feeds / job posts etc) [1]

[1] https://arxiv.org/abs/2008.06759

PreviousNLP NextComputer Vision

Last updated 3 years ago