Advanced RAG Techniques - Query expansion, Cross-encoder and Dense Passage Retrieval(DPR)

Advanced RAG Techniques - Query expansion, Cross-encoder and Dense Passage Retrieval(DPR)CS 지식/AI 관련2024. 12. 29. 20:31@ray5273

Table of Contents

Naive RAG

1. Indexing

문서로 부터 데이터를 추출합니다.
- PDF,HTML, Word 같은 다양한 파일을 plain text로 변환합니다.
- LLM 모델은 한번에 처리가능한 텍스트의 최대 길이가 제한되어있습니다.
- 그래서 데이터를 청크 단위로 분리해서 효율적으로 관리합니다. (Chunking)

2. Retrieval

사용자 Query를 vector화 시키고
관련 데이터를 가지고 있는 Vector DB에 존재하는 vector와 비교했습니다.

3. Generation

쿼리와 데이터를 prompt에 입력하여 합치는 과정입니다.
vector db에 저장된 데이터와 LLM의 응답을 합쳐 답변을 생성합니다.

기존 Naive RAG의 문제점

1. Contextual 이해에 한계가 있습니다.

키워드 일치성이나 기본 Semantic search에만 focus 되어있습니다.

2. 연관성(Relevance)과 retrieved된 문서들의 퀄리티도 일관성이 없습니다.

문서의 퀄리티와 연관성의 범주가 너무 다양합니다.
- 예를들면 outdated 된 데이터이거나 신뢰 가능하지 않은 데이터이거나가 이에 해당합니다.

3. Retrieval와 Generation간의 integration이 부족합니다.

Retriever와 generator가 동기화가 되기가 힘들어서 최적화 되지 않았습니다.

4. Large-scale data에 대해서 처리가 비 효율적입니다.

Scaling issue가 있고 관련 자료를 찾는데 너무 오래걸리기도 합니다.

5. robustness와 adaptability가 부족합니다.

유저가 원하는 내용에 대해서 적응형으로 데이터를 제공하기가 힘들었습니다.

그래서 이를 해결하기 위한 Advanced RAG 기법들을 추가했습니다.

Advanced RAG

Advanced RAG에는 몇 가지 집중하는 전략이 있습니다.

1. Pre-retrieval

인덱스 최적화
- 사용자 쿼리와 indexing structure를 향상시킵니다.
데이터 품질 향상
- 데이터 세부 사항을 향상시키거나
- 메타데이터 정보를 추가하거나 등의 처리를 합니다.

2. Post-retrieval

원래의 사용자 쿼리와 pre-retrieval data를 합칩니다.
- 중요한 데이터를 강조하기 위해서 rank를 부여한다던지의 방식으로 처리합니다.

이런 방식 중에 첫번째 방법으로는 Query Expansion이 있습니다.

Query Expansion

과정은 아래와 같습니다.

1. 사용자 Query를 LLM에 요청합니다.

2. LLM으로 부터 받은 Hallucinated 답변을 다시 Vector DB에 query를 요청 보냅니다.

3. 그리고 vector DB로 부터 받은 보정된 Query Result를 다시 LLM으로 보내서 답변을 생성합니다.

이 과정을 통해서 LLM은 가지고 있지 않은 실제 Query 관련 데이터를 Vector DB로 보완하는 방식을 취합니다.

Query Expansion 적용의 결과 예시

이 그림은 Vector가 얼마나 비슷해졌는지를 보여줍니다.

빨간색 : 적용 전

주황색 : 적용 후

회색점 : Vector DB에 저장된 PDF 데이터들

초록 동그라미 : Vector에서 검색한 가장 유사한 5개의 답변

실제 테스트를 보면 완벽하지는 않지만 주황색 X가 vector와 유사한 곳이 많은 위치로 옮겨 간 모습을 볼 수 있습니다.

Query Expansion with Multiple Queries

두번째 방법은 단일 쿼리를 Vector DB 및 LLM에 전달하는 대신에 여러개의 연관 Query를 생성해서 더 좋은 결과를 내는 방식입니다.

단일 쿼리만으로 표현이 충분하지 않을때 여러개의 유사/보완 쿼리를 한번에 고려하여 검색하는 방법입니다.

예를 들면 아래와 같습니다.

버거 + 햄버거 맛집 + 수제버거
AI + 인공지능 기술 동향 + 머신러닝

과 같이 여러 연관되는 키워드/문장을 추가로 검색시켜서 검색 품질을 향상시키는 것이죠

아래는 microsoft의 연간 보고서에 대한 데이터를 저장하고 이와 관련된 쿼리를 처리하는 실제 예시입니다.

Original Query

What details can you provide about the factors that led to revenue growth?

비슷한 Multiple query 생성하기

기존 Query를 통해서 LLM에게 질문을 던져서 비슷한 Query들을 생성해냅니다.

아래는 여러개의 Query를 생성하기 위한 LLM 프롬프트의 예시입니다.

def generate_multi_query(query, model="gpt-3.5-turbo"):

    prompt = """
    You are a knowledgeable financial research assistant. 
    Your users are inquiring about an annual report. 
    For the given question, propose up to five related questions to assist them in finding the information they need. 
    Provide concise, single-topic questions (without compounding sentences) that cover various aspects of the topic. 
    Ensure each question is complete and directly related to the original inquiry. 
    List each question on a separate line without numbering.
                """
    messages = [
        {
            "role": "system",
            "content": prompt,
        },
        {"role": "user", "content": query},
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    content = content.split("\n")
    return content

LLM에게 기존 Query와 연관된 5개의 쿼리를 생성하도록 요청합니다.

- What specific product or service offerings drove the revenue growth?

 - Were there any pricing adjustments that influenced the revenue increase?

 - How did changes in market demand impact the company's revenue growth?

 - Did the company expand into new markets or regions that contributed to the revenue increase?

 - Were there any strategic partnerships or collaborations that helped drive revenue growth?

관련 질문에 대한 결과도 항상 달라 질 수 있습니다.

코드를 두번째 실행 했을때 나오는 5개의 관련된 질문입니다.

- Were there any new products or services introduced in the market?
- Did the company expand into new geographic regions during the period?
- Were there any significant marketing or advertising campaigns implemented?
- Did the company experience an increase in customer base or market share?
- Were there any changes in pricing strategy that positively impacted revenue growth?

이 5개의 Query를 다시 Vector DB에 전달해서 답변을 받습니다.

하나의 쿼리당 아래와 같은 결과를 받을 수있는데 하나만 글에 예시로 포함해보도록 하겠습니다.

Query: - Were there any pricing adjustments that influenced the revenue increase?

Results:
sales and marketing expenses increased $ 934 million or 4 % driven by 3 points of growt
h from the nuance and xandr acquisitions and investments in commercial sales, offset in
part by a decline in windows advertising. sales and marketing included a favorable for
eign currency impact of 2 %. general and administrative general and administrative expe
nses include payroll, employee benefits, stock - based compensation expense, employee s
everance expense incurred as part of a corporate program, and other headcount - related
expenses associated with ( in millions, except percentages ) 2023 2022 percentage chan
ge research and development $ 27, 195 $ 24, 512 11 % as a percent of revenue 13 % 12 %
1ppt ( in millions, except percentages ) 2023 2022 percentage change sales and marketin
g $ 22, 759 $ 21, 825 4 % as a percent of revenue 11 % 11 % 0ppt ( in millions, except
percentages ) 2023 2022 percentage change

• gaming revenue decreased $ 764 million or 5 % driven by declines in xbox hardware and
xbox content and services. xbox hardware revenue decreased 11 % driven by lower volume
and price of consoles sold. xbox content and services revenue decreased 3 % driven by
a decline in first - party content, offset in part by growth in xbox game pass. • searc
h and news advertising revenue increased $ 617 million or 5 %. search and news advertis
ing revenue excluding traffic acquisition costs increased 11 % driven by higher search
volume and the xandr acquisition. operating income decreased $ 4. 0 billion or 20 %. •
gross margin decreased $ 4. 2 billion or 13 % driven by declines in windows and devices
. gross margin percentage decreased driven by a decline in devices. • operating expense
s decreased $ 195 million or 2 % driven by a decline in devices, offset in part by inve
stments in search and news advertising, including 2 points of growth from the xandr acq
uisition.

intelligent cloud revenue increased $ 12. 9 billion or 17 %. • server products and clou
d services revenue increased $ 12. 6 billion or 19 % driven by azure and other cloud se
rvices. azure and other cloud services revenue grew 29 % driven by growth in our consum
ption - based services. server products revenue decreased 1 %. • enterprise services re
venue increased $ 315 million or 4 % driven by growth in enterprise support services, o
ffset in part by a decline in industry solutions ( formerly microsoft consulting servic
es ). operating income increased $ 4. 7 billion or 14 %. • gross margin increased $ 8.
9 billion or 17 % driven by growth in azure and other cloud services and the change in
accounting estimate. gross margin percentage decreased slightly. excluding the impact o
f the change in accounting estimate, gross margin percentage decreased 3 points driven
by sales mix shift to azure and other cloud services and a decline in azure and other c
loud services.

marketing, and selling our other products and services ; and income taxes. highlights f
rom fiscal year 2023 compared with fiscal year 2022 included : • microsoft cloud revenu
e increased 22 % to $ 111. 6 billion. • office commercial products and cloud services r
evenue increased 10 % driven by office 365 commercial growth of 13 %. • office consumer
products and cloud services revenue increased 2 % and microsoft 365 consumer subscribe
rs increased to 67. 0 million. • linkedin revenue increased 10 %. • dynamics products a
nd cloud services revenue increased 16 % driven by dynamics 365 growth of 24 %. • serve
r products and cloud services revenue increased 19 % driven by azure and other cloud se
rvices growth of 29 %. • windows original equipment manufacturer licensing ( “ windows
oem ” ) revenue decreased 25 %. • devices revenue decreased 24 %. • windows commercial
products and cloud services revenue increased 5 %. • xbox content and services revenue
decreased 3 %.

segment results of operations reportable segments fiscal year 2023 compared with fiscal
year 2022 productivity and business processes revenue increased $ 5. 9 billion or 9 %.
• office commercial products and cloud services revenue increased $ 3. 7 billion or 10
%. office 365 commercial revenue grew 13 % with seat growth of 11 %, driven by small a
nd medium business and frontline worker offerings, as well as growth in revenue per use
r. office commercial products revenue declined 21 % driven by continued customer shift
to cloud offerings. • office consumer products and cloud services revenue increased $ 1
40 million or 2 %. microsoft 365 consumer subscribers grew 12 % to 67. 0 million. • lin
kedin revenue increased $ 1. 3 billion or 10 % driven by talent solutions. • dynamics p
roducts and cloud services revenue increased $ 750 million or 16 % driven by dynamics 3
65 growth of 24 %. operating income increased $ 4. 5 billion or 15 %.

vector DB에서 나온 각 5개의 답변들을 종합해서 LLM에게 최종적으로 전달하고 텍스트를 생성해 낼 수 있습니다.

최종 결과

아래와 같은 결과가 예시로 나올 수 있습니다.

문서 정보에 따라서 어느정도 보정이 된 모습을 볼 수 있었습니다.

Final Answer:
['The factors that led to revenue growth for the company include:', '', 
'1. **Acquisitions and Investments:** Sales and marketing expenses increased, driven by growth from the Nuance and Xandr acquisitions. This contributed 3 points of growth to revenue.', '', 
'2. **Commercial Sales Focus:** Investments in commercial sales also played a role in driving revenue growth.', '', 
'3. **Foreign Currency Impact:** Sales and marketing expenses included a favorable foreign currency impact of 2%, which also contributed to revenue growth.', '', 
'4. **Segment Revenue Increases:** For the fiscal year 2023, revenue growth was observed across various segments such as Productivity and Business Processes, Intelligent Cloud, and More Personal Computing.', '', 
'5. **Specific Product and Service Offerings:** Revenue growth was driven by specific product and service offerings within segments, such as Dynamics 365 growth, Azure and other cloud services, and Microsoft 365 subscriptions.', '', 
'6. **Market Demand:** Changes in market demand, especially increased demand for Microsoft Cloud services, Office products, and Dynamics products, influenced revenue growth.', '', 
"7. **Strategic Focus:** The company's strategic focus on reaching new users in new markets, like frontline workers, small and medium businesses, and growth markets, also contributed to revenue growth.", '', 
"Overall, a combination of strategic investments, acquisitions, foreign exchange impacts, market demand shifts, and specific product offerings led to the company's revenue growth in the specified period."]

Query Expansion with multiple queries의 단점

당연하게도 Query 숫자를 늘렸으니 결과가 많아 집니다.
- 그리고 Query들은 항상 관계가 있거나 의미있는 결과는 아닙니다.
Query 결과 자체도 항상 관계가 있거나 의미가 있지는 않습니다.

Re-Ranking (Cross-Encoder) 방식

1. Re-Ranking이란?

Re-Ranking은 기존 검색 혹은 필터링 단계를 거쳐 나온 상위 후보(예: 상위 100개 문서)에 대해 추가적인 모델이나 규칙을 적용해 최종 순위를 재정렬하는 과정입니다.
보통 초기 검색 단계에서 전통적인 BM25, 혹은 Bi-Encoder(dual-encoder) 등을 이용해 빠르게 많은 후보를 긁어오고, 이후에 계산량이 큰 Cross-Encoder 등으로 상위 후보만 세밀하게 평가해서 최종 결과를 도출합니다.

2. Cross-Encoder란?

쿼리와 문서를 단일 Transformer에 동시에 입력하여 [CLS]토큰 또는 특정 헤드에서 유사도를 직접적으로 산출하는 모델 구조를 의미합니다.

BERT를 쓰는 이유

1. 깊은 문맥적 이해(Contextual Understanding)

- 전통적인 임베딩 모델(Word2Vec, GloVe 등)은 단어 또는 토큰마다 고정된 벡터를 만듭니다. 예: “apple”이라는 단어 벡터는 ‘사과(과일)’인지 ‘Apple Inc.’인지 구분하지 못함.
- BERT는 Transformer 구조를 통해, 문장 전체 토큰이 서로 교차 주의(Attention) 를 주고받아, “apple”의 앞뒤 문맥을 보고 ‘회사’인지 ‘과일’인지 다르게 임베딩합니다.
- 즉, 동음이의어 처리가 훨씬 정교해져서, 검색·질의응답 등에서 높은 정확도를 보여줍니다.

2. Fine-tuning 유연성

- BERT는 미리 학습된(pre-trained) 모델 위에 태스크별(예: 분류, 질의응답, NER 등) Head를 조금만 달아서 Fine-tuning을 수행합니다.
- 예: Re-Ranking, 질의응답(QA), 문서 분류 등 다양한 태스크로 빠르게 전환(transfer) 이 가능하므로, 개발 생산성 면에서 강력한 이점이 있습니다.

3. 이미 충분한 생태계 & 연구 사례

Cross-Encoder와 Re-Ranking 동작 방식

1. 후보 집합 추출(Initial Retrieval)

검색 파이프라인 첫 단계: BM25, Sparse/Dense Retrieval, 혹은 Bi-Encoder 모델을 이용해 수십~수백 개의 후보를 빠르게 추출합니다.
예: "How to cook pasta"라는 쿼리에 대해 BM25로 가장 관련 있어 보이는 문서 top-100을 가져온다.

실제 코드에서는 Vector DB에 Query를 보냄으로써 동작합니다.

2. Cross-Encoder 입력 생성

추출된 각 후보 문서와 쿼리를 한 쌍으로 묶어, Transformer 입력 형태(Query [SEP] Doc)로 만듭니다.
실제 입력길이가 매우 길 경우 문서 일부만 발췌하거나 요약해서 넣는 전략도 사용합니다.

실제 코드에서는 1에서 나온 후보 문서와 쿼리를 합쳐 BERT 모델에 돌려봅니다.

3. Relevance Score 계산

Cross-Encoder가 쿼리와 문서 쌍을 인코딩해 [CLS] 출력(또는 마지막 레이어 특정 노드)에 분류 헤드(Fully Connected + 활성화 함수)를 붙여 Relevance Score를 예측합니다.
예: 0~1 범위로 변환해 “1에 가까울수록 쿼리와 문서가 매우 관련 있음”으로 해석.

2에서 나온 결과를 소팅합니다.

4. 후보 문서 재정렬

위에서 계산된 스코어 기반으로 모든 후보 문서를 내림차순(혹은 적절한 규칙)으로 재정렬(Re-Rank)하여, 최종 랭킹을 결정합니다.

실제 코드에서는 관련 스코어를 뽑아낸 값을 기준으로 관련도가 높은 문서만을 추려내고 이를 활용합니다.

Use case

Search Engine (검색)
Question Answering Syhstem
추천 시스템
법률 문서 검색

DPR (Dense Passage Retrieval)

질문과 문서(피시지)를 딥러닝 임베딩으로 매핑 한 후 유사도 검색을 통해 관련 문설를 빠르게 찾는 Bi-Encoder 기반 Dense Retrieval 기법입니다.

Use cases

Open-domain question answering
- 문맥적 의미 이해가 뛰어나 오픈 도메인 QA에서 좋은 성능을 보입니다.
Document retrieval
- 벡터 DB와 결합하면 실시간 검색도 가능합니다.
Customer support

참조

RAG의 패러다임(Naive RAG, Advanced RAG, Modular RAG)

오픈AI의 GPT 시리즈, Meta의 LLama 시리즈, Google의 Gemini와 같은 대형 언어 모델(LLM)은 생성 AI분야에서 큰 성과를 이루게 되었다. 하지만 위와 같은 모델들의 문제점은 종종 부정확하거나 관련 없는

g3lu.tistory.com

저작자표시

'CS 지식 > AI 관련' 카테고리의 다른 글

Zero-shot learning, Metric Learning Approach 이해하기 (0)	2025.01.28
Few-Shot Learning 이해하기 (0)	2025.01.28
Vector DB란 무엇인가? (3)	2024.12.26
Vector DB : 전통적인 DB와 비교한 Vector DB의 특징들 (1)	2024.12.25
Vector DB - Vector Similarity 측정 방법 3가지 (0)	2024.12.22

@ray5273 :: Micro Changes, Macro Impact

개발 및 IT 관련 포스팅을 작성 하는 블로그입니다.

IT 기술 및 개인 개발에 대한 내용을 작성하는 블로그입니다. 많은 분들과 소통하며 의견을 나누고 싶습니다.