discourse-ai/lib/completions/endpoints/vllm.rb

# frozen_string_literal: true

module DiscourseAi
  module Completions
    module Endpoints
      class Vllm < Base
        def self.can_contact?(model_provider)
          model_provider == "vllm"
        end

        def normalize_model_params(model_params)
          model_params = model_params.dup

          # max_tokens, temperature are already supported
          if model_params[:stop_sequences]
            model_params[:stop] = model_params.delete(:stop_sequences)
          end

          model_params
        end

        def default_options
          { max_tokens: 2000, model: llm_model.name }
        end

        def provider_id
          AiApiAuditLog::Provider::Vllm
        end

        private

        def model_uri
          if llm_model.url.to_s.starts_with?("srv://")
            service = DiscourseAi::Utils::DnsSrv.lookup(llm_model.url.sub("srv://", ""))
            api_endpoint = "https://#{service.target}:#{service.port}/v1/chat/completions"
          else
            api_endpoint = llm_model.url
          end

          @uri ||= URI(api_endpoint)
        end

        def prepare_payload(prompt, model_params, dialect)
          payload = default_options.merge(model_params).merge(messages: prompt)
          if @streaming_mode
            payload[:stream] = true if @streaming_mode
            payload[:stream_options] = { include_usage: true }
          end

          payload
        end

        def prepare_request(payload)
          headers = { "Referer" => Discourse.base_url, "Content-Type" => "application/json" }

          api_key = llm_model&.api_key || SiteSetting.ai_vllm_api_key
          headers["X-API-KEY"] = api_key if api_key.present?

          Net::HTTP::Post.new(model_uri, headers).tap { |r| r.body = payload }
        end

        def xml_tools_enabled?
          true
        end

        def final_log_update(log)
          log.request_tokens = @prompt_tokens if @prompt_tokens
          log.response_tokens = @completion_tokens if @completion_tokens
        end

        def decode(response_raw)
          json = JSON.parse(response_raw, symbolize_names: true)
          @prompt_tokens = json.dig(:usage, :prompt_tokens)
          @completion_tokens = json.dig(:usage, :completion_tokens)
          [json.dig(:choices, 0, :message, :content)]
        end

        def decode_chunk(chunk)
          @json_decoder ||= JsonStreamDecoder.new
          (@json_decoder << chunk)
            .map do |parsed|
              # vLLM keeps sending usage over and over again
              prompt_tokens = parsed.dig(:usage, :prompt_tokens)
              completion_tokens = parsed.dig(:usage, :completion_tokens)

              @prompt_tokens = prompt_tokens if prompt_tokens

              @completion_tokens = completion_tokens if completion_tokens

              text = parsed.dig(:choices, 0, :delta, :content)
              if text.to_s.empty?
                nil
              else
                text
              end
            end
            .compact
        end
      end
    end
  end
end
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`# frozen_string_literal: true`

			`module DiscourseAi`
			`module Completions`
			`module Endpoints`
			`class Vllm < Base`
DEV: Remove old code now that features rely on LlmModels. (#729) * DEV: Remove old code now that features rely on LlmModels. * Hide old settings and migrate persona llm overrides * Remove shadowing special URL + seeding code. Use srv:// prefix instead. 2024-07-30 12:44:57 -04:00			`def self.can_contact?(model_provider)`
			`model_provider == "vllm"`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`

FIX: AI helper not working correctly with mixtral (#399) * FIX: AI helper not working correctly with mixtral This PR introduces a new function on the generic llm called #generate This will replace the implementation of completion! #generate introduces a new way to pass temperature, max_tokens and stop_sequences Then LLM implementers need to implement #normalize_model_params to ensure the generic names match the LLM specific endpoint This also adds temperature and stop_sequences to completion_prompts this allows for much more robust completion prompts * port everything over to #generate * Fix translation - On anthropic this no longer throws random "This is your translation:" - On mixtral this actually works * fix markdown table generation as well 2024-01-04 07:53:47 -05:00			`def normalize_model_params(model_params)`
			`model_params = model_params.dup`

			`# max_tokens, temperature are already supported`
			`if model_params[:stop_sequences]`
			`model_params[:stop] = model_params.delete(:stop_sequences)`
			`end`

			`model_params`
			`end`

Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`def default_options`
DEV: Remove old code now that features rely on LlmModels. (#729) * DEV: Remove old code now that features rely on LlmModels. * Hide old settings and migrate persona llm overrides * Remove shadowing special URL + seeding code. Use srv:// prefix instead. 2024-07-30 12:44:57 -04:00			`{ max_tokens: 2000, model: llm_model.name }`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`

			`def provider_id`
			`AiApiAuditLog::Provider::Vllm`
			`end`

			`private`

			`def model_uri`
DEV: Remove old code now that features rely on LlmModels. (#729) * DEV: Remove old code now that features rely on LlmModels. * Hide old settings and migrate persona llm overrides * Remove shadowing special URL + seeding code. Use srv:// prefix instead. 2024-07-30 12:44:57 -04:00			`if llm_model.url.to_s.starts_with?("srv://")`
FIX: Add tool support to open ai compatible dialect and vllm (#734) * FIX: Add tool support to open ai compatible dialect and vllm Automatic tools are in progress in vllm see: https://github.com/vllm-project/vllm/pull/5649 Even when they are supported, initial support will be uneven, only some models have native tool support notably mistral which has some special tokens for tool support. After the above PR lands in vllm we will still need to swap to XML based tools on models without native tool support. * fix specs 2024-08-02 08:52:33 -04:00			`service = DiscourseAi::Utils::DnsSrv.lookup(llm_model.url.sub("srv://", ""))`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00			`api_endpoint = "https://#{service.target}:#{service.port}/v1/chat/completions"`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`else`
DEV: Remove old code now that features rely on LlmModels. (#729) * DEV: Remove old code now that features rely on LlmModels. * Hide old settings and migrate persona llm overrides * Remove shadowing special URL + seeding code. Use srv:// prefix instead. 2024-07-30 12:44:57 -04:00			`api_endpoint = llm_model.url`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`
DEV: Remove old code now that features rely on LlmModels. (#729) * DEV: Remove old code now that features rely on LlmModels. * Hide old settings and migrate persona llm overrides * Remove shadowing special URL + seeding code. Use srv:// prefix instead. 2024-07-30 12:44:57 -04:00
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`@uri \|\|= URI(api_endpoint)`
			`end`

FIX: Add tool support to open ai compatible dialect and vllm (#734) * FIX: Add tool support to open ai compatible dialect and vllm Automatic tools are in progress in vllm see: https://github.com/vllm-project/vllm/pull/5649 Even when they are supported, initial support will be uneven, only some models have native tool support notably mistral which has some special tokens for tool support. After the above PR lands in vllm we will still need to swap to XML based tools on models without native tool support. * fix specs 2024-08-02 08:52:33 -04:00			`def prepare_payload(prompt, model_params, dialect)`
			`payload = default_options.merge(model_params).merge(messages: prompt)`
FEATURE: improve tool support (#904) This re-implements tool support in DiscourseAi::Completions::Llm #generate Previously tool support was always returned via XML and it would be the responsibility of the caller to parse XML New implementation has the endpoints return ToolCall objects. Additionally this simplifies the Llm endpoint interface and gives it more clarity. Llms must implement decode, decode_chunk (for streaming) It is the implementers responsibility to figure out how to decode chunks, base no longer implements. To make this easy we ship a flexible json decoder which is easy to wire up. Also (new) Better debugging for PMs, we now have a next / previous button to see all the Llm messages associated with a PM Token accounting is fixed for vllm (we were not correctly counting tokens) 2024-11-11 16:14:30 -05:00			`if @streaming_mode`
			`payload[:stream] = true if @streaming_mode`
			`payload[:stream_options] = { include_usage: true }`
			`end`
FIX: Add tool support to open ai compatible dialect and vllm (#734) * FIX: Add tool support to open ai compatible dialect and vllm Automatic tools are in progress in vllm see: https://github.com/vllm-project/vllm/pull/5649 Even when they are supported, initial support will be uneven, only some models have native tool support notably mistral which has some special tokens for tool support. After the above PR lands in vllm we will still need to swap to XML based tools on models without native tool support. * fix specs 2024-08-02 08:52:33 -04:00
			`payload`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`

			`def prepare_request(payload)`
			`headers = { "Referer" => Discourse.base_url, "Content-Type" => "application/json" }`
FEATURE: Support for SRV records for Discourse services (#414) This allows admins to configure services with multiple backends using DNS SRV records. This PR also adds support for shared secret auth via headers for TEI and vLLM endpoints, so they are inline with the other ones. 2024-01-10 17:23:07 -05:00
FEATURE: Set endpoint credentials directly from LlmModel. (#625) * FEATURE: Set endpoint credentials directly from LlmModel. Drop Llama2Tokenizer since we no longer use it. * Allow http for custom LLMs --------- Co-authored-by: Rafael Silva <xfalcox@gmail.com> 2024-05-16 08:50:22 -04:00			`api_key = llm_model&.api_key \|\| SiteSetting.ai_vllm_api_key`
			`headers["X-API-KEY"] = api_key if api_key.present?`
FEATURE: Support for SRV records for Discourse services (#414) This allows admins to configure services with multiple backends using DNS SRV records. This PR also adds support for shared secret auth via headers for TEI and vLLM endpoints, so they are inline with the other ones. 2024-01-10 17:23:07 -05:00
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`Net::HTTP::Post.new(model_uri, headers).tap { \|r\| r.body = payload }`
			`end`

FEATURE: improve tool support (#904) This re-implements tool support in DiscourseAi::Completions::Llm #generate Previously tool support was always returned via XML and it would be the responsibility of the caller to parse XML New implementation has the endpoints return ToolCall objects. Additionally this simplifies the Llm endpoint interface and gives it more clarity. Llms must implement decode, decode_chunk (for streaming) It is the implementers responsibility to figure out how to decode chunks, base no longer implements. To make this easy we ship a flexible json decoder which is easy to wire up. Also (new) Better debugging for PMs, we now have a next / previous button to see all the Llm messages associated with a PM Token accounting is fixed for vllm (we were not correctly counting tokens) 2024-11-11 16:14:30 -05:00			`def xml_tools_enabled?`
			`true`
			`end`

			`def final_log_update(log)`
			`log.request_tokens = @prompt_tokens if @prompt_tokens`
			`log.response_tokens = @completion_tokens if @completion_tokens`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00
FEATURE: improve tool support (#904) This re-implements tool support in DiscourseAi::Completions::Llm #generate Previously tool support was always returned via XML and it would be the responsibility of the caller to parse XML New implementation has the endpoints return ToolCall objects. Additionally this simplifies the Llm endpoint interface and gives it more clarity. Llms must implement decode, decode_chunk (for streaming) It is the implementers responsibility to figure out how to decode chunks, base no longer implements. To make this easy we ship a flexible json decoder which is easy to wire up. Also (new) Better debugging for PMs, we now have a next / previous button to see all the Llm messages associated with a PM Token accounting is fixed for vllm (we were not correctly counting tokens) 2024-11-11 16:14:30 -05:00			`def decode(response_raw)`
			`json = JSON.parse(response_raw, symbolize_names: true)`
			`@prompt_tokens = json.dig(:usage, :prompt_tokens)`
			`@completion_tokens = json.dig(:usage, :completion_tokens)`
			`[json.dig(:choices, 0, :message, :content)]`
			`end`

			`def decode_chunk(chunk)`
			`@json_decoder \|\|= JsonStreamDecoder.new`
			`(@json_decoder << chunk)`
			`.map do \|parsed\|`
			`# vLLM keeps sending usage over and over again`
			`prompt_tokens = parsed.dig(:usage, :prompt_tokens)`
			`completion_tokens = parsed.dig(:usage, :completion_tokens)`

			`@prompt_tokens = prompt_tokens if prompt_tokens`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00
FEATURE: improve tool support (#904) This re-implements tool support in DiscourseAi::Completions::Llm #generate Previously tool support was always returned via XML and it would be the responsibility of the caller to parse XML New implementation has the endpoints return ToolCall objects. Additionally this simplifies the Llm endpoint interface and gives it more clarity. Llms must implement decode, decode_chunk (for streaming) It is the implementers responsibility to figure out how to decode chunks, base no longer implements. To make this easy we ship a flexible json decoder which is easy to wire up. Also (new) Better debugging for PMs, we now have a next / previous button to see all the Llm messages associated with a PM Token accounting is fixed for vllm (we were not correctly counting tokens) 2024-11-11 16:14:30 -05:00			`@completion_tokens = completion_tokens if completion_tokens`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00
FEATURE: improve tool support (#904) This re-implements tool support in DiscourseAi::Completions::Llm #generate Previously tool support was always returned via XML and it would be the responsibility of the caller to parse XML New implementation has the endpoints return ToolCall objects. Additionally this simplifies the Llm endpoint interface and gives it more clarity. Llms must implement decode, decode_chunk (for streaming) It is the implementers responsibility to figure out how to decode chunks, base no longer implements. To make this easy we ship a flexible json decoder which is easy to wire up. Also (new) Better debugging for PMs, we now have a next / previous button to see all the Llm messages associated with a PM Token accounting is fixed for vllm (we were not correctly counting tokens) 2024-11-11 16:14:30 -05:00			`text = parsed.dig(:choices, 0, :delta, :content)`
			`if text.to_s.empty?`
			`nil`
			`else`
			`text`
			`end`
			`end`
			`.compact`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00			`end`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`
			`end`
			`end`
			`end`