Link checker plugin and some fixes to URLs

Signed-off-by: Miki <mehranb@amazon.com>
2021-08-10 11:54:40 -07:00 · 2021-08-10 11:54:40 -07:00 · 634db90e9b
parent 187bccec6b
commit 634db90e9b
17 changed files with 252 additions and 18 deletions
--- a/_clients/logstash/index.md
+++ b/_clients/logstash/index.md
@ -4,6 +4,8 @@ title: Logstash
 nav_order: 200
 has_children: true
 has_toc: true
+redirect_from:
+- /logstash/
 ---

 # Logstash
--- a/_clients/logstash/ship-to-opensearch.md
+++ b/_clients/logstash/ship-to-opensearch.md
@ -9,7 +9,7 @@ nav_order: 220

 You can Ship Logstash events to an OpenSearch cluster and then visualize your events with OpenSearch Dashboards.

-Make sure you have [Logstash]({{site.url}}{{site.baseurl}}/logstash/index/#install-logstash-on-mac--linux), [OpenSearch]({{site.url}}{{site.baseurl}}/opensearch/install/index/), and [OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/install/index/).
+Make sure you have [Logstash]({{site.url}}{{site.baseurl}}/clients/logstash/index/#install-logstash), [OpenSearch]({{site.url}}{{site.baseurl}}/opensearch/install/index/), and [OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/dashboards/install/index/).
 {: .note }

 ## OpenSearch output plugin
--- a/_im-plugin/ism/policies.md
+++ b/_im-plugin/ism/policies.md
@ -94,7 +94,7 @@ For a list of available unit types, see [Supported units]({{site.url}}{{site.bas

 ISM supports the following operations:

- [force_merge](#forcemerge)
+- [force_merge](#force_merge)
 - [read_only](#read_only)
 - [read_write](#read_write)
 - [replica_count](#replica_count)
--- a/_monitoring-plugins/alerting/api.md
+++ b/_monitoring-plugins/alerting/api.md
@ -874,7 +874,7 @@ GET _plugins/_alerting/monitors/alerts
 Introduced 1.0
 {: .label .label-purple }

-[After getting your alerts](#get-alerts/), you can acknowledge any number of active alerts in one call. If the alert is already in an ERROR, COMPLETED, or ACKNOWLEDGED state, it appears in the `failed` array.
+[After getting your alerts](#get-alerts), you can acknowledge any number of active alerts in one call. If the alert is already in an ERROR, COMPLETED, or ACKNOWLEDGED state, it appears in the `failed` array.


 #### Request
--- a/_monitoring-plugins/alerting/monitors.md
+++ b/_monitoring-plugins/alerting/monitors.md
@ -34,7 +34,7 @@ Destination | A reusable location for an action, such as Amazon Chime, Slack, or
 1. Specify a name for the destination so that you can identify it later.
 1. For **Type**, choose Slack, Amazon Chime, custom webhook, or [email](#email-as-a-destination).

-For Email type, refer to [Email as a destination](#email-as-a-destination) section below. For all other types, specify the webhook URL. For more information about webhooks, see the documentation for [Slack](https://api.slack.com/incoming-webhooks) and [Chime](https://docs.aws.amazon.com/chime/latest/ug/webhooks.html).
+For Email type, refer to [Email as a destination](#email-as-a-destination) section below. For all other types, specify the webhook URL. For more information about webhooks, see the documentation for [Slack](https://api.slack.com/incoming-webhooks) and [Amazon Chime](https://docs.aws.amazon.com/chime/latest/ug/webhooks.html).

 For custom webhooks, you must specify more information: parameters and headers. For example, if your endpoint requires basic authentication, you might need to add a header with a key of `Authorization` and a value of `Basic <Base64-encoded-credential-string>`. You might also need to change `Content-Type` to whatever your webhook requires. Popular values are `application/json`, `application/xml`, and `text/plain`.

@ -296,7 +296,7 @@ Variable | Data Type | Description
 `ctx.trigger.actions.destination_id`| String | The alert destination's ID.
 `ctx.trigger.actions.message_template.source` | String | The message to send in the alert.
 `ctx.trigger.actions.message_template.lang` | String | The scripting language used to define the message. Must be Mustache.
-`ctx.trigger.actions.throttle_enabled` | Boolean | Whether throttling is enabled for this trigger. See [adding actions](#add-actions/) for more information about throttling.
+`ctx.trigger.actions.throttle_enabled` | Boolean | Whether throttling is enabled for this trigger. See [adding actions](#add-actions) for more information about throttling.
 `ctx.trigger.actions.subject_template.source` | String | The message's subject in the alert.
 `ctx.trigger.actions.subject_template.lang` | String | The scripting language used to define the subject. Must be mustache.

--- a/_opensearch/bucket-agg.md
+++ b/_opensearch/bucket-agg.md
@ -660,7 +660,7 @@ GET opensearch_dashboards_sample_data_logs/_search
 ```

 The `ip_range` aggregation is for IP addresses.
-It works on `ip` type fields. You can define the IP ranges and masks in the [CIDR](http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing) notation.
+It works on `ip` type fields. You can define the IP ranges and masks in the [CIDR](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing) notation.

 ```json
 GET opensearch_dashboards_sample_data_logs/_search
@ -1026,7 +1026,7 @@ GET opensearch_dashboards_sample_data_logs/_search

 The `geohash_grid` aggregation buckets documents for geographical analysis. It organizes a geographical region into a grid of smaller regions of different sizes or precisions. Lower values of precision represent larger geographical areas and higher values represent smaller, more precise geographical areas.

-The number of results returned by a query might be far too many to display each geo point individually on a map. The `geohash_grid` aggregation buckets nearby geo points together by calculating the Geohash for each point, at the level of precision that you define (between 1 to 12; the default is 5). To learn more about Geohash, see [Wikipedia](http://en.wikipedia.org/wiki/Geohash).
+The number of results returned by a query might be far too many to display each geo point individually on a map. The `geohash_grid` aggregation buckets nearby geo points together by calculating the Geohash for each point, at the level of precision that you define (between 1 to 12; the default is 5). To learn more about Geohash, see [Wikipedia](https://en.wikipedia.org/wiki/Geohash).

 The web logs example data is spread over a large geographical area, so you can use a lower precision value. You can zoom in on this map by increasing the precision value:

--- a/_opensearch/data-streams.md
+++ b/_opensearch/data-streams.md
@ -289,4 +289,4 @@ You can use wildcards to delete more than one data stream.

 We recommend deleting data from a data stream using an ISM policy.

-You can also use [asynchronous search]({{site.url}}{{site.baseurl}}/async/index/) and [SQL]({{site.url}}{{site.baseurl}}/sql/index/) and [PPL]({{site.url}}{{site.baseurl}}/ppl/index/) to query your data stream directly. You can also use the security plugin to define granular permissions on the data stream name.
+You can also use [asynchronous search]({{site.url}}{{site.baseurl}}/search-plugins/async/index/) and [SQL]({{site.url}}{{site.baseurl}}/search-plugins/sql/index/) and [PPL]({{site.url}}{{site.baseurl}}/search-plugins/ppl/index/) to query your data stream directly. You can also use the security plugin to define granular permissions on the data stream name.
--- a/_opensearch/search-template.md
+++ b/_opensearch/search-template.md
@ -12,7 +12,7 @@ For example, if you use OpenSearch as a backend search engine for your applicati

 When you're writing code to convert user input into OpenSearch queries, you can simplify your code with search templates. If you need to add fields to your search query, you can just modify the template without making changes to your code.

-Search templates use the Mustache language. For a list of all syntax options, see the [Mustache manual](http://mustache.github.io/mustache.5.html).
+Search templates use the Mustache language. For a list of all syntax options, see the [Mustache manual](https://mustache.github.io/mustache.5.html).
 {: .note }

 ## Create search templates
--- a/_plugins/link-checker.rb
+++ b/_plugins/link-checker.rb
@ -0,0 +1,232 @@
+# frozen_string_literal: true
+
+require "jekyll/hooks"
+require "jekyll/document"
+require "json"
+require "set"
+require "uri"
+require "pathname"
+
+##
+# This singleton checks links during build to warn or fail upon finding dead links.
+#
+# `JEKYLL_CHECK_EXTERNAL_LINKS`, set on the environment, will cause verification of external links, irrespective of its
+# value. Usage: `JEKYLL_CHECK_EXTERNAL_LINKS= bundle exec jekyll build --trace`
+#
+# `JEKYLL_FATAL_LINK_CHECKER`, set on the environment, will cause the build to fail if an internal dead link is found.
+# If set as `JEKYLL_FATAL_LINK_CHECKER=2`, the build will fail for internal and external dead links; in this case, there
+# is no need to set `JEKYLL_CHECK_EXTERNAL_LINKS`.
+
+module Jekyll::LinkChecker
+
+  ##
+  # The collection that will get stores as the output
+
+  @urls = {}
+
+  ##
+  # Pattern to identify documents that should be excluded based on their URL
+
+  @excluded_paths = /(\.(css|js|json|map|xml|txt|yml)$)/i.freeze
+
+  ##
+  # Pattern to identify certain HTML tags whose content should be excluded from indexing
+
+  @href_matcher = /<a[^>]+href=(['"])(.+?)\1/im.freeze
+
+  ##
+  # Pattern to check for external URLs
+
+  @external_matcher = /^https?:\/\//.freeze
+
+  ##
+  # List of domains to ignore
+  @ignored_domains = %w[localhost]
+
+  ##
+  # Pattern of local paths to ignore
+  @ignored_paths = /(^\/javadocs\/)/.freeze
+
+  ##
+  # Valid response codes for successful links
+  @success_codes = %w[200 302]
+
+  ##
+  # Questionable response codes for successful links
+  @questionable_codes = %w[301 403 429]
+
+  ##
+  # Holds the list of failures
+  @failures = []
+
+  ##
+  # Driven by environment variables, it indicates a need to check external links
+  @check_external_links
+
+  ##
+  # Driven by environment variables, it indicates the need to fail the build for dead links
+  @should_build_fatally
+
+
+  ##
+  # Initializes the singleton by recording the site
+
+  def self.init(site)
+    @site = site
+    @urls = {}
+    @failures = []
+  end
+
+  ##
+  # Processes a Document or Page and adds the links to a collection
+  # It also checks for anchors to parts of the same page/doc
+
+  def self.process(page)
+    return if @excluded_paths.match(page.path)
+
+    hrefs = page.content.scan(@href_matcher)
+    hrefs.each do |(_, href)|
+      relative_path = page.path[0] == '/' ? Pathname.new(page.path).relative_path_from(Dir.getwd) : page.path
+
+      if href.start_with? '#'
+        p relative_path if (page.content =~ /<[a-z0-9-]+[^>]+id="#{href[1..]}"/i).nil?
+        @failures << "##{href[1..]}, linked in ./#{relative_path}" if (page.content =~ /<[a-z0-9-]+[^>]+id="#{href[1..]}"/i).nil?
+      else
+        @urls[href] = Set[] unless @urls.key?(href)
+        @urls[href] << relative_path
+      end
+    end
+  end
+
+  ##
+  # Saves the collection as a JSON file
+
+  def self.verify(site)
+    if ENV.key?('JEKYLL_CHECK_EXTERNAL_LINKS')
+      @check_external_links = true
+      puts "LinkChecker: [Notice] Will verify external links"
+    end
+
+    if ENV.key?('JEKYLL_FATAL_LINK_CHECKER')
+      @should_build_fatally = true
+      if ENV['JEKYLL_FATAL_LINK_CHECKER'] == '2'
+        @check_external_links = true
+        puts "LinkChecker: [Notice] The build will fail if any dead links are found"
+      else
+        puts "LinkChecker: [Notice] The build will fail if a dead internal link is found"
+      end
+    end
+
+    @base_url_matcher = /^#{@site.config["url"]}#{@site.baseurl}(\/.*)$/.freeze
+
+    @urls.each do |url, pages|
+      @failures << "#{url}, linked to in ./#{pages.to_a.join(", ./")}" unless self.check(url)
+    end
+    
+    msg = "Found #{@failures.size} dead link#{@failures.size > 1 ? 's' : ''}:\n#{@failures.join("\n")}" unless @failures.empty?
+
+    if @should_build_fatally
+      raise msg
+    else
+      puts "\nLinkChecker: [Warning] #{msg}\n"
+    end
+  end
+
+  ##
+  # Check if URL is accessible
+
+  def self.check(url)
+    match = @base_url_matcher.match(url)
+    unless match.nil?
+      url = match[1]
+    end
+
+    if @external_matcher =~ url
+      return true unless @check_external_links
+      return self.check_external(url)
+    end
+
+    return self.check_internal(url)
+  end
+
+  ##
+  # Check if an external URL is accessible by making a HEAD call
+
+  def self.check_external(url)
+    uri = URI(url)
+    return true if @ignored_domains.include? uri.host
+
+    (Net::HTTP.new uri.host, uri.port).tap do |http|
+      http.use_ssl = true
+    end.start do |http|
+      http.use_ssl = (uri.scheme == "https")
+
+      request = Net::HTTP::Get.new(uri)
+
+      http.request(request) do |response|
+        return true if @success_codes.include? response.code
+
+        puts "LinkChecker: [Warning] Got #{response.code} from #{url}"
+        return @questionable_codes.include? response.code
+      end
+    end
+  end
+
+  ##
+  # Check if an internal link is accessible
+
+  def self.check_internal(url)
+    return true if @ignored_paths =~ url
+
+    path, hash = url.split('#')
+
+    unless path.end_with? 'index.html'
+      path << '/' unless path.end_with? '/'
+      path << 'index.html' unless path.end_with? 'index.html'
+    end
+
+    filename = File.join(@site.config["destination"], path)
+
+    return false unless File.file?(filename)
+
+    content = File.read(filename)
+    unless content.include? "<title>Redirecting"
+      return true if hash.nil? || hash.empty?
+      return !(content =~ /<[a-z0-9-]+[^>]+id="#{hash}"/i).nil?
+    end
+
+    match = content.match(@href_matcher)
+    if match.nil?
+      puts "LinkChecker: [Warning] Cannot check #{url} due to an unfollowable redirect"
+      return true
+    end
+
+    redirect = match[2]
+    redirect << '#' + hash unless hash.nil? || hash.empty?
+    return self.check(redirect)
+  end
+end
+
+# Before any Document or Page is processed, initialize the LinkChecker
+
+Jekyll::Hooks.register :site, :pre_render do |site|
+  Jekyll::LinkChecker.init(site)
+end
+
+# Process a Page as soon as its content is ready
+
+Jekyll::Hooks.register :pages, :post_convert do |page|
+  Jekyll::LinkChecker.process(page)
+end
+
+# Process a Document as soon as its content is ready
+
+Jekyll::Hooks.register :documents, :post_convert do |document|
+  Jekyll::LinkChecker.process(document)
+end
+
+# Verify gathered links after Jekyll is done writing all its stuff
+
+Jekyll::Hooks.register :site, :post_write do |site|
+  Jekyll::LinkChecker.verify(site)
+end
--- a/_search-plugins/knn/approximate-knn.md
+++ b/_search-plugins/knn/approximate-knn.md
@ -12,7 +12,7 @@ redirect_from: /knn/approximate-knn/

 The approximate k-NN method uses [nmslib's](https://github.com/nmslib/nmslib/) implementation of the Hierarchical Navigable Small World (HNSW) algorithm to power k-NN search. In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three methods, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is preferred.

-The k-NN plugin builds an HNSW graph of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, please refer to [Apache Lucene's documentation](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description). These graphs are loaded into native memory during search and managed by a cache. To learn more about pre-loading graphs into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup). Additionally, you can see what graphs are already loaded in memory, which you can learn more about in the [stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).
+The k-NN plugin builds an HNSW graph of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, please refer to [Apache Lucene's documentation](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description). These graphs are loaded into native memory during search and managed by a cache. To learn more about pre-loading graphs into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation). Additionally, you can see what graphs are already loaded in memory, which you can learn more about in the [stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).

 Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor search.

--- a/_search-plugins/knn/painless-functions.md
+++ b/_search-plugins/knn/painless-functions.md
@ -14,7 +14,7 @@ With the k-NN plugin's Painless Scripting extensions, you can use k-NN distance

 ## Get started with k-NN's Painless Scripting functions

-To use k-NN's Painless Scripting functions, first create an index with `knn_vector` fields like in [k-NN score script]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-score-script#getting-started-with-the-score-script). Once the index is created and you ingest some data, you can use the painless extensions:
+To use k-NN's Painless Scripting functions, first create an index with `knn_vector` fields like in [k-NN score script]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-score-script#getting-started-with-the-score-script-for-vectors). Once the index is created and you ingest some data, you can use the painless extensions:

 ```json
 GET my-knn-index-2/_search
--- a/_security-plugin/access-control/users-roles.md
+++ b/_security-plugin/access-control/users-roles.md
@ -12,7 +12,7 @@ The security plugin includes an internal user database. Use this database in pla

 Roles are the core way of controlling access to your cluster. Roles contain any combination of cluster-wide permissions, index-specific permissions, document- and field-level security, and tenants. Then you map users to these roles so that users gain those permissions.

-Unless you need to create new [reserved or hidden users]({{site.url}}{{site.baseurl}}/security-plugin/access-control/api#read-only-and-hidden-resources), we **highly** recommend using OpenSearch Dashboards or the REST API to create new users, roles, and role mappings. The `.yml` files are for initial setup, not ongoing use.
+Unless you need to create new [reserved or hidden users]({{site.url}}{{site.baseurl}}/security-plugin/access-control/api#reserved-and-hidden-resources), we **highly** recommend using OpenSearch Dashboards or the REST API to create new users, roles, and role mappings. The `.yml` files are for initial setup, not ongoing use.
 {: .warning }

 ---
--- a/_security-plugin/audit-logs/index.md
+++ b/_security-plugin/audit-logs/index.md
@ -175,7 +175,7 @@ Use a date pattern in the index name to configure daily, weekly, or monthly roll
 plugins.security.audit.config.index: "'auditlog-'YYYY.MM.dd"
 ```

-For a reference on the date pattern format, see the [Joda DateTimeFormat documentation](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html).
+For a reference on the date pattern format, see the [Joda DateTimeFormat documentation](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html).


 ## (Advanced) Tune the thread pool
--- a/_security-plugin/configuration/configuration.md
+++ b/_security-plugin/configuration/configuration.md
@ -237,7 +237,7 @@ In this case, the header states that the message was signed using HMAC-SHA256.

 ### Payload

-The payload of a JSON web token contains the so-called [JWT Claims](http://self-issued.info/docs/draft-ietf-oauth-json-web-token.html#RegisteredClaimName). A claim can be any piece of information about the user that the application that created the token has verified.
+The payload of a JSON web token contains the so-called [JWT Claims](https://self-issued.info/docs/draft-ietf-oauth-json-web-token.html#RegisteredClaimName). A claim can be any piece of information about the user that the application that created the token has verified.

 The specification defines a set of standard claims with reserved names ("registered claims"). These include, for example, the token issuer, the expiration date, or the creation date.

--- a/_security-plugin/configuration/openid-connect.md
+++ b/_security-plugin/configuration/openid-connect.md
@ -113,7 +113,7 @@ When an IdP generates and signs a JSON web token, it must add the ID of the key
 }
 ```

-As per the [OpenID Connect specification](http://openid.net/specs/openid-connect-messages-1_0-20.html), the `kid` (key ID) is mandatory. Token verification does not work if an IdP fails to add the `kid` field to the JWT.
+As per the [OpenID Connect specification](https://openid.net/specs/openid-connect-messages-1_0-20.html), the `kid` (key ID) is mandatory. Token verification does not work if an IdP fails to add the `kid` field to the JWT.

 If the security plugin receives a JWT with an unknown `kid`, it visits the IdP's `jwks_uri` and retrieves all available, valid keys. These keys are used and cached until a refresh is triggered by retrieving another unknown key ID.

--- a/_troubleshoot/index.md
+++ b/_troubleshoot/index.md
@ -29,7 +29,7 @@ The operating system for each OpenSearch node handles encryption of data at rest
 cryptsetup luksFormat --key-file <key> <partition>
 ```

-For full documentation on the command, see [the Linux man page](http://man7.org/linux/man-pages/man8/cryptsetup.8.html).
+For full documentation on the command, see [the Linux man page](https://man7.org/linux/man-pages/man8/cryptsetup.8.html).

 {% comment %}
 ## Beats
--- a/_troubleshoot/tls.md
+++ b/_troubleshoot/tls.md
@ -21,7 +21,7 @@ This page includes troubleshooting steps for configuring TLS certificates with t

 ## Validate YAML

-`opensearch.yml` and the files in `opensearch_security/securityconfig/` are in the YAML format. A linter like [YAML Lint](http://www.yamllint.com/) can help verify that you don't have any formatting errors.
+`opensearch.yml` and the files in `opensearch_security/securityconfig/` are in the YAML format. A linter like [YAML Validator](https://codebeautify.org/yaml-validator) can help verify that you don't have any formatting errors.


 ## View contents of PEM certificates
@ -207,7 +207,7 @@ plugins.security.ssl.http.enabled_protocols:

 TLS relies on the server and client negotiating a common cipher suite. Depending on your system, the available ciphers will vary. They depend on the JDK or OpenSSL version you're using, and  whether or not the `JCE Unlimited Strength Jurisdiction Policy Files` are installed.

-For legal reasons, the JDK does not include strong ciphers like AES256. In order to use strong ciphers you need to download and install the [Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files](http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html). If you don't have them installed, you might see an error message on startup:
+For legal reasons, the JDK does not include strong ciphers like AES256. In order to use strong ciphers you need to download and install the [Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files](https://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html). If you don't have them installed, you might see an error message on startup:

 ```
 [INFO ] AES-256 not supported, max key length for AES is 128 bit.