Docs as code - (2) Docusaurus v3의 On-premise 검색 엔진으로 Typesense 사용 하기

Docs as code - (2) Docusaurus v3의 On-premise 검색 엔진으로 Typesense 사용 하기업무 개선/Docs as code2024. 12. 11. 00:02@ray5273

Table of Contents

실행 환경

Windows 11 환경에서 실행되었습니다.

또한 localhost (host.docker.internal) 환경에서 docusaurus와 typesense 서버를 실행합니다.

docusaurus 버전은 3.5.2를 사용했습니다.

실행 방법

1. docusaurus.config.ts 파일을 변경합니다.

1) url과 baseUrl을 변경합니다.

어떻게 되어 있었든 localhost에서의 서버 실행을 위해서 아래와 같이 변경합니다.

// Set the production url of your site here
  url: 'http://host.docker.internal',
  // Set the /<baseUrl>/ pathname under which your site is served
  // For GitHub pages deployment, it is often '/<projectName>/'
  baseUrl: '/',

url을 localhost가 아닌 host.docker.internal로 설정해야합니다.

2) docusaurus typesense search theme을 설치합니다.

$ npm install docusaurus-theme-search-typesense@next --save

3) themes와 themeConfig를 docusaurus.config.ts에 추가하기

Typesense 검색창을 추가하기 위해서 아래와 같이 theme과 typesense themeConfig를 추가합니다.

 themes: [
    // ... Your other themes.
    'docusaurus-theme-search-typesense',
  ],
  themeConfig: {
    typesense: {
      // Replace this with the name of your index/collection.
      // It should match the "index_name" entry in the scraper's "config.json" file.
      typesenseCollectionName: 'mycollection',

      typesenseServerConfig: {
        nodes: [
          {
            host: 'host.docker.internal',
            port: 8108,
            protocol: 'http',
          },
        ],
        apiKey: 'xyz',
      },

      // Optional: Typesense search parameters: https://typesense.org/docs/0.24.0/api/search.html#search-parameters
      typesenseSearchParameters: {},

      // Optional
      contextualSearch: true,
    },
    (중략)
  }

typesense-scraper 서버를 docker를 통해서 실행할 예정이므로 위와 같이 설정합니다.

추가 설명을 드리자면

1. host : "host.docker.internal"을 사용합니다.

2. 포트는 8108을 사용합니다.

3. apiKey는 로컬에서 실행할 때는 기본으로 xyz로 설정되어있습니다.

그리고 아래와 같이 docusaurus를 실행합니다.

$ npm run serve -- --build --port=80 --host=host.docker.internal

위 명령어를 통해서 80번 포트에 웹사이트를 켤 수 있습니다.

2. typesense server를 localhost에서 구동합니다.

아래 링크를 통해서 typesense 서버를 구동시킬 수 있었습니다.

Install Typesense | Typesense

Install Typesense Here are a couple of available options to install and run Typesense. Option 1: Typesense Cloud The easiest way to run Typesense is using our managed Cloud service called Typesense Cloud (opens new window). Sign-in with GitHub Pick a confi

typesense.org

docker나 docker-compose를 사용하여 typesense 서버를 켜도록 합니다.

저의 경우에는 docker-compose를 이용했습니다.

typesense에서 가이드 준 대로 typesense-data 폴더를 만들고 docker-compose.yml 파일을 아래와 같이 작성합니다.

services:
  typesense:
    image: typesense/typesense:26.0
    restart: on-failure
    ports:
      - "8108:8108"
    volumes:
      - ./typesense-data:/data
    command: '--data-dir /data --api-key=xyz --enable-cors'

그리고 아래 명령어를 통해서 실행합니다.

docker-compose up

그러면 api key=xyz, port=8108로 실행되게 됩니다.

그리고 마지막으로 docusaurus에서 생성한 문서 사이트의 검색 index 생성이 필요합니다.

3. typesense scraper를 통해서 검색 index를 생성합니다.

localhost에서 docusaurus page에 대한 검색 index를 생성해야합니다.

아래 링크를 참조했습니다.

Search for Documentation Sites | Typesense

Search for Documentation Sites The good folks over at Algolia have built and open-sourced DocSearch (opens new window), which is a suite of tools specifically built to index data from a documentation site and then add a search bar to the site quickly. This

typesense.org

typesense scraper를 실행 하기 위한 docusaurus.json 파일과 environment.env 파일이 필요합니다.

docusaurus.json 파일과 environment.env 파일의 작성이 필요합니다.

1) docusaurus.json파일

{
  "index_name": "mycollection",
  "start_urls": [
    "http://host.docker.internal"
  ],
  "js_wait": 2,
  "js_render": true,
  "sitemap_alternate_links": true,
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, .theme-default-content ul li, .theme-default-content table tbody tr"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  }
}

2) environment.env 파일

TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=host.docker.internal
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http

이 파일들을 활용하여 아래 명령어를 실행하여 typesense의 index를 생성합니다.

3) typesense-scraper 실행

$ docker run -it --env-file=environment.env -e "CONFIG=$(cat docusaurus.json | jq -r tostring)" typesense/docsearch-scraper:0.11.0

실행 결과가 아래와 같이 오류 없이 nb hits가 0 이상이면 index가 정상적으로 생성된것입니다.

4. Typesense 검색 해보기

1번에서 실행한 웹사이트를 들어가서 typesense가 정상 동작하는지 확인합니다.

host.docker.internal:80 을 접속하면 아래와 같이 typesense 검색 UI가 보입니다.

그리고 뭔가를 검색했을때 검색 내용이 아래와 같이 뜨면 typesense 서버가 정상 작동 하는 것입니다.

이렇게 typesense를 이용해서 검색 서버 및 인덱스를 구성할 수 있습니다.

Trouble Shooting 했던 문제

1. 윈도우에서 80번 포트가 이미 사용중인 경우

netstat -ano | findstr :80

PowerShell에서 아래 명령어를 통해서 해당 PID를 kill할 수 있습니다.

Stop-Process -Id 1234 -Force

2. typesense-scraper 실행시 json 파일을 제대로 읽지 못함.

 sh@sh-System-Product-Name  ~/sh/typesense-scraper  docker run -it --net="host" --env-file=.env -e "CONFIG=$(cat config.json | jq -r tostring)" typesense/docsearch-scraper:0.11.0                                                            

Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/requests/models.py", line 974, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/seleuser/src/index.py", line 138, in <module>
    run_config(environ['CONFIG'])
  File "/home/seleuser/src/index.py", line 45, in run_config
    typesense_helper.create_tmp_collection()
  File "/home/seleuser/src/typesense_helper.py", line 38, in create_tmp_collection
    self.typesense_client.collections[self.collection_name_tmp].delete()
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/typesense/collection.py", line 22, in delete
    return self.api_call.delete(self._endpoint_path())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/typesense/api_call.py", line 158, in delete
    return self.make_request(requests.delete, endpoint, True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/typesense/api_call.py", line 129, in make_request
    raise last_exception
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/typesense/api_call.py", line 113, in make_request
    error_message = r.json().get('message', 'API error.')
                    ^^^^^^^^
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/requests/models.py", line 978, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

해결 방법을 아직은 찾지 못했습니다.

나중에 다시 발생하면 추가하도록 하겠습니다.

리눅스에서 실행해보려고 노력했을때 발생 했던 문제였습니다.

3. typesense-scraper 실행시 Connection Refused 발생하는 문제

environment.env의 host 서버가 localhost로 설정되어있으면 docker로 실행되고 있는 typesense 서버와 연결을 하지 못하는 문제입니다.

env 파일이 아래와 같이 되어있으면 문제가 발생합니다.

TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=localhost
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http

host를 host.docker.internal로 바꿔줘야합니다.

위의 가이드 내용을 참고하시길 바랍니다.

4. nbHits가 0인 문제

아래와 같은 문제가 발생하면서 하나도 scraping이 되지 않는 문제가 있습니다.

 docker run -it --env-file=./environment.env -e "CONFIG=$(cat ./docusaurus.json | jq -r tostring)" typesense/docsearch-scraper
INFO:scrapy.utils.log:Scrapy 2.11.2 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.
7.0, Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.36
INFO:scrapy.addons:Enabled addons:
[]
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/uti
ls/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for th
e 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'ERROR',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Typesense DocSearch Scraper (Bot; '
               'https://typesense.org/docs/guide/docsearch.html)'}
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/dup
efilters.py:100: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' me
thod and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
  warn(

WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/dup
efilters.py:59: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for th
e 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  fingerprinter or RequestFingerprinter()

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
ERROR:scrapy.core.engine:Error while obtaining start requests
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/core/engine.py", line 182, in _next_request
    request = next(self.slot.start_requests)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/seleuser/src/documentation_spider.py", line 130, in start_requests
    "alternative_links": DocumentationSpider.to_other_scheme(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/seleuser/src/documentation_spider.py", line 53, in to_other_scheme
    assert match
AssertionError
2024-11-28 12:10:23 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/core/engine.py", line 182, in _next_request
    request = next(self.slot.start_requests)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/seleuser/src/documentation_spider.py", line 130, in start_requests
    "alternative_links": DocumentationSpider.to_other_scheme(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/seleuser/src/documentation_spider.py", line 53, in to_other_scheme
    assert match
AssertionError
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'elapsed_time_seconds': 0.001925,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 11, 28, 12, 10, 23, 152662, tzinfo=datetime.timezone.utc),
 'log_count/ERROR': 1,
 'memusage/max': 75788288,
 'memusage/startup': 75788288,
 'start_time': datetime.datetime(2024, 11, 28, 12, 10, 23, 150737, tzinfo=datetime.timezone.utc)}
INFO:scrapy.core.engine:Spider closed (finished)
DEBUG:selenium.webdriver.remote.remote_connection:DELETE http://localhost:35403/session/c64d94edb3bed9b8409943a1cab6f773 {}
DEBUG:urllib3.connectionpool:http://localhost:35403 "DELETE /session/c64d94edb3bed9b8409943a1cab6f773 HTTP/11" 200 0 
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})    
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request

Crawling issue: nbHits 0 for ComprehensiveCollection

이 문제는 env 파일에서 url이 잘못 되어 있을때의 문제였습니다.

아래와 같이 꺾쇠가 들어가있는 url의 링크의 경우 문제가 발생함을 확인할 수 있었습니다.

{
  "index_name": "ComprehensiveCollection",
  "start_urls": [
    {
      "url": "<http://host.docker.internal/>"
    }
  ],
  "sitemap_alternate_links": true,
  "sitemap_urls": [
    "<http://host.docker.internal/sitemap.xml>"
  ],
  "stop_urls": [
    "/tests"
  ],
  "js_render": true,
  "selectors": {
  "lvl0": "h1",
  "lvl1": "h2",
  "lvl2": "h3",
  "lvl3": "h4",
  "lvl4": "h5",
  "lvl5": "h6",
  "lvl5": "h6",
  "text": "p, li"
},
"strip_chars": " .,;:#"
}

혹은 아래와 같이 host.docker.internal을 사용하지 않고 localhost를 쓰는경우에도 동일하게 nbHits는 0으로 index가 생성 되지 않습니다.

{
  "index_name": "mycollection",
  "start_urls": [
    "http://localhost"
  ],
  "js_wait": 2,
  "js_render": true,
  "sitemap_alternate_links": true,
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, .theme-default-content ul li, .theme-default-content table tbody tr"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  }
}

아래와 같이 json 파일을 설정하면 문제를 해결 가능합니다.

{
  "index_name": "mycollection",
  "start_urls": [
    "http://host.docker.internal"
  ],
  "js_wait": 2,
  "js_render": true,
  "sitemap_alternate_links": true,
  "selectors": {
    "default": {
      "lvl0": "h1",
      "lvl1": "h2",
      "lvl2": "h3",
      "lvl3": "h4",
      "lvl4": "h5",
      "text": "p, .theme-default-content ul li, .theme-default-content table tbody tr"
    }
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  }
}

성공하면 아래와 같이 nb Hits가 0을 초과하는 값이 나옵니다.

저작자표시 (새창열림)

'업무 개선 > Docs as code' 카테고리의 다른 글

Docs as code - (4) Docusaurus 방문자 확인 on-premise ackee analytics 적용기 (1)	2024.12.22
Docs as code - (3) Docusaurus 페이지에 대해 Typesense scraper로 검색 인덱스가 잘 생성되지 않는 문제 해결 (0)	2024.12.22
Docs as code - (1) Docusaurus v3의 on-premise analytics (matomo) 설정하기 (https 설정 추가) (0)	2024.12.05
ADR 도입기 - (3) Docs as code 추가 기능 도입 (Local LLM으로 문서 번역, 문서 페이지 접속 데이터) (0)	2024.10.15
Docusaurus Sidebar 제목에 prefix Icon 추가하기 (0)	2024.09.14

@ray5273 :: Micro Changes, Macro Impact

개발 및 IT 관련 포스팅을 작성 하는 블로그입니다.

IT 기술 및 개인 개발에 대한 내용을 작성하는 블로그입니다. 많은 분들과 소통하며 의견을 나누고 싶습니다.