discourse-ai/lib/completions/endpoints/vllm.rb

# frozen_string_literal: true

module DiscourseAi
  module Completions
    module Endpoints
      class Vllm < Base
        class << self
          def can_contact?(endpoint_name)
            endpoint_name == "vllm"
          end

          def dependant_setting_names
            %w[ai_vllm_endpoint_srv ai_vllm_endpoint]
          end

          def correctly_configured?(_model_name)
            SiteSetting.ai_vllm_endpoint_srv.present? || SiteSetting.ai_vllm_endpoint.present?
          end

          def endpoint_name(model_name)
            "vLLM - #{model_name}"
          end
        end

        def normalize_model_params(model_params)
          model_params = model_params.dup

          # max_tokens, temperature are already supported
          if model_params[:stop_sequences]
            model_params[:stop] = model_params.delete(:stop_sequences)
          end

          model_params
        end

        def default_options
          { max_tokens: 2000, model: model }
        end

        def provider_id
          AiApiAuditLog::Provider::Vllm
        end

        private

        def model_uri
          if llm_model&.url && !llm_model&.url == LlmModel::RESERVED_VLLM_SRV_URL
            return URI(llm_model.url)
          end

          service = DiscourseAi::Utils::DnsSrv.lookup(SiteSetting.ai_vllm_endpoint_srv)
          if service.present?
            api_endpoint = "https://#{service.target}:#{service.port}/v1/chat/completions"
          else
            api_endpoint = "#{SiteSetting.ai_vllm_endpoint}/v1/chat/completions"
          end
          @uri ||= URI(api_endpoint)
        end

        def prepare_payload(prompt, model_params, _dialect)
          default_options
            .merge(model_params)
            .merge(messages: prompt)
            .tap { |payload| payload[:stream] = true if @streaming_mode }
        end

        def prepare_request(payload)
          headers = { "Referer" => Discourse.base_url, "Content-Type" => "application/json" }

          api_key = llm_model&.api_key || SiteSetting.ai_vllm_api_key
          headers["X-API-KEY"] = api_key if api_key.present?

          Net::HTTP::Post.new(model_uri, headers).tap { |r| r.body = payload }
        end

        def partials_from(decoded_chunk)
          decoded_chunk
            .split("\n")
            .map do |line|
              data = line.split("data: ", 2)[1]
              data == "[DONE]" ? nil : data
            end
            .compact
        end

        def extract_completion_from(response_raw)
          parsed = JSON.parse(response_raw, symbolize_names: true).dig(:choices, 0)
          # half a line sent here
          return if !parsed

          response_h = @streaming_mode ? parsed.dig(:delta) : parsed.dig(:message)

          response_h.dig(:content)
        end
      end
    end
  end
end
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`# frozen_string_literal: true`

			`module DiscourseAi`
			`module Completions`
			`module Endpoints`
			`class Vllm < Base`
UX: Validations to LLM-backed features (except AI Bot) (#436) * UX: Validations to Llm-backed features (except AI Bot) This change is part of an ongoing effort to prevent enabling a broken feature due to lack of configuration. We also want to explicit which provider we are going to use. For example, Claude models are available through AWS Bedrock and Anthropic, but the configuration differs. Validations are: * You must choose a model before enabling the feature. * You must turn off the feature before setting the model to blank. * You must configure each model settings before being able to select it. * Add provider name to summarization options * vLLM can technically support same models as HF * Check we can talk to the selected model * Check for Bedrock instead of anthropic as a site could have both creds setup 2024-01-29 14:04:25 -05:00			`class << self`
HACK: Llama3 support for summarization/AI helper. (#616) There are still some limitations to which models we can support with the `LlmModel` class. This will enable support for Llama3 while we sort those out. 2024-05-13 14:54:42 -04:00			`def can_contact?(endpoint_name)`
			`endpoint_name == "vllm"`
UX: Validations to LLM-backed features (except AI Bot) (#436) * UX: Validations to Llm-backed features (except AI Bot) This change is part of an ongoing effort to prevent enabling a broken feature due to lack of configuration. We also want to explicit which provider we are going to use. For example, Claude models are available through AWS Bedrock and Anthropic, but the configuration differs. Validations are: * You must choose a model before enabling the feature. * You must turn off the feature before setting the model to blank. * You must configure each model settings before being able to select it. * Add provider name to summarization options * vLLM can technically support same models as HF * Check we can talk to the selected model * Check for Bedrock instead of anthropic as a site could have both creds setup 2024-01-29 14:04:25 -05:00			`end`

			`def dependant_setting_names`
			`%w[ai_vllm_endpoint_srv ai_vllm_endpoint]`
			`end`

			`def correctly_configured?(_model_name)`
			`SiteSetting.ai_vllm_endpoint_srv.present? \|\| SiteSetting.ai_vllm_endpoint.present?`
			`end`

			`def endpoint_name(model_name)`
			`"vLLM - #{model_name}"`
			`end`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`

FIX: AI helper not working correctly with mixtral (#399) * FIX: AI helper not working correctly with mixtral This PR introduces a new function on the generic llm called #generate This will replace the implementation of completion! #generate introduces a new way to pass temperature, max_tokens and stop_sequences Then LLM implementers need to implement #normalize_model_params to ensure the generic names match the LLM specific endpoint This also adds temperature and stop_sequences to completion_prompts this allows for much more robust completion prompts * port everything over to #generate * Fix translation - On anthropic this no longer throws random "This is your translation:" - On mixtral this actually works * fix markdown table generation as well 2024-01-04 07:53:47 -05:00			`def normalize_model_params(model_params)`
			`model_params = model_params.dup`

			`# max_tokens, temperature are already supported`
			`if model_params[:stop_sequences]`
			`model_params[:stop] = model_params.delete(:stop_sequences)`
			`end`

			`model_params`
			`end`

Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`def default_options`
FEATURE: Mixtral for summarization (#381) 2023-12-26 15:50:02 -05:00			`{ max_tokens: 2000, model: model }`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`

			`def provider_id`
			`AiApiAuditLog::Provider::Vllm`
			`end`

			`private`

			`def model_uri`
DEV: Rewire AI bot internals to use LlmModel (#638) * DRAFT: Create AI Bot users dynamically and support custom LlmModels * Get user associated to llm_model * Track enabled bots with attribute * Don't store bot username. Minor touches to migrate default values in settings * Handle scenario where vLLM uses a SRV record * Made 3.5-turbo-16k the default version so we can remove hack 2024-06-18 13:32:14 -04:00			`if llm_model&.url && !llm_model&.url == LlmModel::RESERVED_VLLM_SRV_URL`
			`return URI(llm_model.url)`
			`end`
FEATURE: Set endpoint credentials directly from LlmModel. (#625) * FEATURE: Set endpoint credentials directly from LlmModel. Drop Llama2Tokenizer since we no longer use it. * Allow http for custom LLMs --------- Co-authored-by: Rafael Silva <xfalcox@gmail.com> 2024-05-16 08:50:22 -04:00
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`service = DiscourseAi::Utils::DnsSrv.lookup(SiteSetting.ai_vllm_endpoint_srv)`
			`if service.present?`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00			`api_endpoint = "https://#{service.target}:#{service.port}/v1/chat/completions"`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`else`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00			`api_endpoint = "#{SiteSetting.ai_vllm_endpoint}/v1/chat/completions"`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`
			`@uri \|\|= URI(api_endpoint)`
			`end`

			`def prepare_payload(prompt, model_params, _dialect)`
			`default_options`
			`.merge(model_params)`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00			`.merge(messages: prompt)`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`.tap { \|payload\| payload[:stream] = true if @streaming_mode }`
			`end`

			`def prepare_request(payload)`
			`headers = { "Referer" => Discourse.base_url, "Content-Type" => "application/json" }`
FEATURE: Support for SRV records for Discourse services (#414) This allows admins to configure services with multiple backends using DNS SRV records. This PR also adds support for shared secret auth via headers for TEI and vLLM endpoints, so they are inline with the other ones. 2024-01-10 17:23:07 -05:00
FEATURE: Set endpoint credentials directly from LlmModel. (#625) * FEATURE: Set endpoint credentials directly from LlmModel. Drop Llama2Tokenizer since we no longer use it. * Allow http for custom LLMs --------- Co-authored-by: Rafael Silva <xfalcox@gmail.com> 2024-05-16 08:50:22 -04:00			`api_key = llm_model&.api_key \|\| SiteSetting.ai_vllm_api_key`
			`headers["X-API-KEY"] = api_key if api_key.present?`
FEATURE: Support for SRV records for Discourse services (#414) This allows admins to configure services with multiple backends using DNS SRV records. This PR also adds support for shared secret auth via headers for TEI and vLLM endpoints, so they are inline with the other ones. 2024-01-10 17:23:07 -05:00
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`Net::HTTP::Post.new(model_uri, headers).tap { \|r\| r.body = payload }`
			`end`

			`def partials_from(decoded_chunk)`
			`decoded_chunk`
			`.split("\n")`
			`.map do \|line\|`
			`data = line.split("data: ", 2)[1]`
			`data == "[DONE]" ? nil : data`
			`end`
			`.compact`
			`end`
REFACTOR: Migrate Vllm/TGI-served models to the OpenAI format. (#588) Both endpoints provide OpenAI-compatible servers. The only difference is that Vllm doesn't support passing tools as a separate parameter. Even if the tool param is supported, it ultimately relies on the model's ability to handle native functions, which is not the case with the models we have today. As a part of this change, we are dropping support for StableBeluga/Llama2 models. They don't have a chat_template, meaning the new API can translate them. These changes let us remove some of our existing dialects and are a first step in our plan to support any LLM by defining them as data-driven concepts. I rewrote the "translate" method to use a template method and extracted the tool support strategies into its classes to simplify the code. Finally, these changes bring support for Ollama when running in dev mode. It only works with Mistral for now, but it will change soon.. 2024-05-07 09:02:16 -04:00
			`def extract_completion_from(response_raw)`
			`parsed = JSON.parse(response_raw, symbolize_names: true).dig(:choices, 0)`
			`# half a line sent here`
			`return if !parsed`

			`response_h = @streaming_mode ? parsed.dig(:delta) : parsed.dig(:message)`

			`response_h.dig(:content)`
			`end`
Mixtral (#376) Add both Mistral and Mixtral support. Also includes vLLM-openAI inference support. Co-authored-by: Roman Rizzi <rizziromanalejandro@gmail.com> 2023-12-26 12:49:55 -05:00			`end`
			`end`
			`end`
			`end`