This is an importer I wrote to restore some users that were
accidentally deleted for being purged as old staged users or old
unactivated users.
It reads from CSV files exported from a discourse sql backup.
When running an import script there are many site settings that are
changed but we reset them back to where they were originally before the
import. However, there are two settings that we don't roll back:
```
purge_unactivated_users_grace_period_days
purge_deleted_uploads_grace_period_days
```
which could have some unintended consequences.
My first question is do we *really* have to change these settings? I'm
not a huge fan of changing someones settings without them really knowing
they were changed.
If we really do have to change these settings here is my proposed PR
where we don't alter the `purge_unactivated_users_grace_period_days` if
it has been disabled.
As I'm writing this another change we could make is that we don't change
either of these site settings if we detect that they aren't set to the
default values.
The drive behind this PR is that there is a discourse instance which
relies on staged users as part of their workflow and this setting was
changed by accident via the import script causing users to be deleted
that shouldn't have been.
You can use `discourse restore --location=local FILENAME` if you want to restore a backup that is stored locally even though the `backup_location` has the value `s3`.
The 'Discourse SSO' protocol is being rebranded to DiscourseConnect. This should help to reduce confusion when 'SSO' is used in the generic sense.
This commit aims to:
- Rename `sso_` site settings. DiscourseConnect specific ones are prefixed `discourse_connect_`. Generic settings are prefixed `auth_`
- Add (server-side-only) backwards compatibility for the old setting names, with deprecation notices
- Copy `site_settings` database records to the new names
- Rename relevant translation keys
- Update relevant translations
This commit does **not** aim to:
- Rename any Ruby classes or methods. This might be done in a future commit
- Change any URLs. This would break existing integrations
- Make any changes to the protocol. This would break existing integrations
- Change any functionality. Further normalization across DiscourseConnect and other auth methods will be done separately
The risks are:
- There is no backwards compatibility for site settings on the client-side. Accessing auth-related site settings in Javascript is fairly rare, and an error on the client side would not be security-critical.
- If a plugin is monkey-patching parts of the auth process, changes to locale keys could cause broken error messages. This should also be unlikely. The old site setting names remain functional, so security-related overrides will remain working.
A follow-up commit will be made with a post-deploy migration to delete the old `site_settings` rows.
* FEATURE: Import attachments
* FEATURE: Add support for importing multiple forums in one
* FEATURE: Add support for category and tag mapping
* FEATURE: Import groups
* FIX: Add spaces around images
* FEATURE: Custom mapping of user rank to trust levels
* FIX: Do not fail import if it cannot import polls
* FIX: Optimize existing records lookup
Co-authored-by: Gerhard Schlager <mail@gerhard-schlager.at>
Co-authored-by: Jarek Radosz <jradosz@gmail.com>
After running the Discourse merge script, it was pretty evident it held up well after all these years ;)
Made a few fixes:
Included an environment variable for DB_PASS as likely the password will need to be changed if running the import in an official Docker container (recommended)
Set a hard order for imported categories, otherwise sometimes they'd be imported in a weird order making things unpredictable for parent/child category imports
Fixed a couple of instances where we added unique indexes (such as on category slugs)
Set up upload regex to handle AWS URLs better
Fixed the script to work with frozen string literals
This commit adds an additional find_user_by_email hook to ManagedAuthenticator so that GitHub login can continue to support secondary email addresses
The github_user_infos table will be dropped in a follow-up commit.
This is the last core authenticator to be migrated to ManagedAuthenticator 🎉
* ensure emails don't have spaces
* import banned users as suspended for 1k yrs
* upgrade users to TL2 if they have comments
* topic: import views, closed and pinned info
* import messages
* encode vanilla usernames for permalinks. Vanilla usernames can contain spaces and special characters.
* parse Vanilla's new rich body format
Adjustments to the base:
1. PG connection doesn't require host - it was broken on import droplet
2. Drop `topic_reply_count` - it was removed here - https://github.com/discourse/discourse/blob/master/db/post_migrate/20200513185052_drop_topic_reply_count.rb
3. Error with `backtrace.join("\n")` -> `e.backtrace.join("\n")`
4. Correctly link the user and avatar to quote block
Adjustments to vanilla:
1. Top-level Vanilla categories are valid categories
2. Posts have `format` column which should be used to decide if the format is HTML or Markdown
3. Remove no UTF8 characters
4. Remove not supported HTML elements like `font` `span` `sub` `u`
- Stream the queries that load the imported_ids
- Use an array instead of a hash for keeping the mapping between imported_ids and new ids
- Ensure we always treat the imported_ids as integers instead of strings
Fixed bugs, added specs, extracted the upload downsizing code to a class, added support for non-S3 setups, changed it so that images aren't downloaded twice.
This code has been tested on production and successfully resized ~180k uploads.
Includes:
* DEV: Extract upload downsizing logic
* DEV: Add support for non-S3 uploads
* DEV: Process only images uploaded by users
* FIX: Incorrect usage of `count` and `exist?` typo
* DEV: Spec S3 image downsizing
* DEV: Avoid downloading images twice
* DEV: Update filesizes earlier in the process
* DEV: Return false on invalid upload
* FIX: Download images that currently above the limit (If the image size limit is decreased, then there was no way to resize those images that now fall outside the allowed size range)
* Update script/downsize_uploads.rb (Co-authored-by: Régis Hanol <regis@hanol.fr>)
Refactors script to follow conventions of other importers and adds some features including like import, processing of post raw text, and, if needed, SSO import.
FEATURE: new rake task to update first_post_created_at column
The not-equal operator (`<>`) in PostgreSQL does not compare values
with NULL. We should instead use `IS DISTINCT FROM` when comparing
values with NULL.
Moves the most important checks into a linter. It gets executed by Lefthook as well as the docker rake task and Github actions. Doing those checks in rspec takes too long and it produces errors when the discourse:test Docker image contains old, invalid locale files.
`Facter.reset` (65d167eac9/lib/facter.rb (L126-L137)) clears `Facter::Options[:external_dir]` which seems to be the 4.x equivalent of `Facter::Util::Config.external_facts_dirs`.
This commit also makes sure that version 4.0 or higher is installed.
Correctly handles more upload formats in posts, updates post custom fields, fixes more edge cases, adds debugging capabilities. (VERBOSE=1 and INTERACTIVE=1 flags)
Includes these commits and some more:
* DEV: Show the fixed image dimensions
* FIX: Support more upload url formats
* DEV: Remove the old upload after updating posts
* FIX: Use the `process_post_#{id}` mutex
* FIX: Avoid rebaking twice
* DEV: Print out the link to the post
* DEV: Process posts chronologically
* DEV: Do a dry-run before saving, pause on any issue
* FIX: Also process deleted posts
* DEV: Make matchers case-insensitive
* DEV: Pause on "detached" uploads, add more debug info
* DEV: Print out time when finished
* DEV: Add support for WORKER_ID/WORKER_COUNT
* DEV: Fix the onebox in cooked text heuristic
* DEV: Don't report already processed posts
* DEV: Beep when done!
* DEV: Ignore issues with deleted posts
* DEV: Ignore issues with deleted topics
* DEV: Multiline SQL
* DEV: Use the bulk attribute assignment
* DEV: Add ENV["INTERACTIVE"] mode
* DEV: Handle post custom fields
* DEV: Bail on non-S3 sites
* DEV: Allow sizes smaller than 1 mpix
* PERF: Dematerialize topic_reply_count
It's only ever used for trust level promotions that run daily, or compared to 0. We don't need to track it on every post creation.
* UX: Add symbol in TL3 report if topic reply count is capped
* DEV: Drop user_stats.topic_reply_count column
* DEV: api documentation updates
- Created a script to convert json responses to rswag
- Documented several api endpoints
- Switched rswag to use header based auth
* Update script, fix some schema missmatches
We have the `# frozen_string_literal: true` comment on all our
files. This means all string literals are frozen. There is no need
to call #freeze on any literals.
For files with `# frozen_string_literal: true`
```
puts %w{a b}[0].frozen?
=> true
puts "hi".frozen?
=> true
puts "a #{1} b".frozen?
=> true
puts ("a " + "b").frozen?
=> false
puts (-("a " + "b")).frozen?
=> true
```
For more details see: https://samsaffron.com/archive/2018/02/16/reducing-string-duplication-in-ruby
* DEV: Update the working tree just once.
`git pull` was effectively doing `git fetch` and `git merge FETCH_HEAD`, and only then we were checking out the desired branch/commit. This change will skip the the merge step.
* DEV: Don't run lefthook in docker_test
I'm not clear why changing only the `wait_for_url` address was necessary and not also the `get` a few lines above, but this change seems to work for me on both literatecomputing.com Groups and a public group.
Checking if all records have been imported uses a temp table in PostgreSQL. This fails when pgbouncer is used unless the temp table is created inside a transaction.
* Detects mostly all attachments and it's a lot faster
* Parses user properties in Ruby instead of the DB, because that's less errorprone
* Imports user avatars
* Imports topic views by users
* Better handling of quotes and YouTube links
* Adds ability to map forums to categories and tags as well as ignore forums.
* Fixes regular expression for detecting attachments in posts.
* Handles "remote attachments" 😮 by inserting a link.
* Imports view counts for topics.
* Handles incorrect references of parent posts.
* Better handling of quotes.
* Finds a lot more attachments by trying to replace various Unicode characters in filenames.
* Customizable email subject prefixes to remove "Re" and "Fwd" as well as localized prefixes.
* Configuration option for prefixes like [FOO] or (BAR) which can be replaced with tags during import.
* Bugfix: Import script might have skipped some users due to missing ORDER BY.
Posts without a user probably shouldn't happen unless there was some direct database tampering, but data like that has been seen in the wild.
The importer will assign those posts to the "system" user.
Previously we had many places in the app that called `hostname` to get
hostname of a server. This commit replaces the pattern in 2 ways
1. We cache the result in `Discourse.os_hostname` so it is only ever called once
2. We prefer to use Socket.gethostname which avoids making a shell command
This improves performance as we are not spawning hostname processes throughout
the app lifetime
This is not used in core or official plugins, and has been printing a deprecation notice since v2.3.0beta4. All OpenID 2.0 code and dependencies have been dropped. The user_open_ids table remains for now, in case anyone has missed the deprecation notice, and needs to migrate their data.
Context at https://meta.discourse.org/t/-/113249
With this change the script:
* Actually removes original large-sized images
* Doesn't save processed files if their size has increased
* Prevents inconsistent state
We pick the first topic with 30 responses as our bench topic.
Previously we simply picked the last topic, but hand no guarantee on ordering.
This also attempts to correct previous runs of the bench.
The following methods have long been deprecated in ruby due to flaws in their implementation per http://blade.nagaokaut.ac.jp/cgi-bin/vframe.rb/ruby/ruby-core/29293?29179-31097:
URI.escape
URI.unescape
URI.encode
URI.unencode
escape/encode are just aliases for one another. This PR uses the Addressable gem to replace these methods with its own encode, unencode, and encode_component methods where appropriate.
I have put all references to Addressable::URI here into the UrlHelper to keep them corralled in one place to make changes to this implementation easier.
Addressable is now also an explicit gem dependency.
We like to stay as close as possible to latest with rubocop cause the cops
get better.
This update required some code changes, specifically the default is to avoid
explicit returns where implicit is done
Also this renames a few rules
This is a bottom up rewrite of Discourse cache to support faster performance
and a limited surface area.
ActiveSupport::Cache::Store accepts many options we do not use, this partial
implementation only picks the bits out that we do use and want to support.
Additionally params are named which avoids typos such as "expires_at" vs "expires_in"
This also moves a few spots in Discourse to use Discourse.cache over setex
Performance of setex and Discourse.cache.write is similar.
Discourse.cache is a more consistent method to use and offers clean fallback
if you are skipping redis
This is part of a larger change that both optimizes Discoruse.cache and omits
use of setex on $redis in favor of consistently using discourse cache
Bench does reveal that use of Rails.cache and Discourse.cache is 1.25x slower
than redis.setex / get so a re-implementation will follow prior to porting
`FileUtils.cd` and `Dir.chdir` cause the working directory to change for the entire process. We run sidekiq jobs, hijacked requests and deferred jobs in threads, which can make working directory changes have unintended side-effects.
- Add a rubocop rule to warn about usage of Dir.chdir and FileUtils.cd
- Added rubocop:disable for scripts used outside the app
- Refactored code using cd to use alternative methods
- Temporarily skipped the rubocop check for lib/backup_restore. This will require more complex refactoring, so I will create a separate PR for review
Doing .pluck(:column).first is a very common pattern in Discourse and in
most cases, a limit cause isn't being added. Instead of adding a limit
clause to all these callsites, this commit adds two new methods to
ActiveRecord::Relation:
pluck_first, equivalent to limit(1).pluck(*columns).first
and pluck_first! which, like other finder methods, raises an exception
when no record is found
The script will now correct all width/height and thumbnail_width/thumbnail_height properties of all the uploaded images.
The script now uses width * height to filter out all unaffected images.
Also handled the case where a downsized image was already an uploaded record.
These scripts are somewhat rough but I needed them to help debug a memory
leak we have noticed in rails 6.
The biggest object script finds all the biggest objects we have in memory
after boot.
The test memory leak runs a very simple iteration through all multisites
and observed memory.
I introduced DemonBase because I had got some conflict between `demon/base.rb` and `jobs/base.rb`, however, to not rename base class, it is possible to use regex on absolute path in Zeitwerk custom inflector.
Zeitwerk simplifies working with dependencies in dev and makes it easier reloading class chains.
We no longer need to use Rails "require_dependency" anywhere and instead can just use standard
Ruby patterns to require files.
This is a far reaching change and we expect some followups here.
Trying to automate the login into a Google account is quite hard. This makes the crawler use the content of a cookies.txt file instead. It also removes a couple of deprecation warnings and adds some color to the output.
This is a temporary workaround for the issue in https://github.com/rails/rails/pull/36949
Discussing a proper fix in Rails with the Rails team.
Prior to this fix we were spinning up a thread every time we closed a connection
to the db.
* REFACTOR: Rename SiteSetting.disable_edit_notifications to disable_system_edit_notifications
- The older name could cause some confusion because the setting does not disable all edit notifications, only system ones.
* FIX: Add frozen_string_literal: true in the migration
* DEV: Deprecate 'disable_edit_notifications'
This also corrects FileHelper.download so it supports "follow_redirect"
correctly (it used to always follow 1 redirect) and adds a `validate_url`
param that will bypass all uri validation if set to false (default is true)
* FEATURE: Add attachment support to XenForo importer
If `ATTACHMENT_DIR` is provided, importer will scan each imported post
for `[GALLERY]` and `[ATTACH]` tags, attempt to import the referenced files
as Discourse uploads and replace the tags with Discourse markup.
References to files which cannot be imported are stripped.
NOTE: This only imports attachments which are referenced in imported
posts. Any XenForo media or files which are not referenced in any post
using `[ATTACH]` or `[GALLERY]` tags will not be imported. The goal is to
ensure that we don't have posts with missing images and unsightly
markup, NOT to ensure that all attachments are migrated.
* FEATURE: Add attachment support to XenForo importer
If `ATTACHMENT_DIR` is provided, importer will scan each imported post
for `[GALLERY]` and `[ATTACH]` tags, attempt to import the referenced files
as Discourse uploads and replace the tags with Discourse markup.
References to files which cannot be imported are stripped.
NOTE: This only imports attachments which are referenced in imported
posts. Any XenForo media or files which are not referenced in any post
using `[ATTACH]` or `[GALLERY]` tags will not be imported. The goal is to
ensure that we don't have posts with missing images and unsightly
markup, NOT to ensure that all attachments are migrated.
* FEATURE: Add attachment support to XenForo importer
If `ATTACHMENT_DIR` is provided, importer will scan each imported post
for `[GALLERY]` and `[ATTACH]` tags, attempt to import the referenced files
as Discourse uploads and replace the tags with Discourse markup.
References to files which cannot be imported are stripped.
NOTE: This only imports attachments which are referenced in imported
posts. Any XenForo media or files which are not referenced in any post
using `[ATTACH]` or `[GALLERY]` tags will not be imported. The goal is to
ensure that we don't have posts with missing images and unsightly
markup, NOT to ensure that all attachments are migrated.
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.
Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
This removes all uses of both `send` and `public_send` from consumers of
SiteSetting and instead introduces a `get` helper for dynamic lookup
This leads to much cleaner and safer code long term as we are always explicit
to test that a site setting is really there before sending an arbitrary
string to the class
It also removes a couple of risky stubs from the auth provider test
`Upload#url` is more likely and can change from time to time. When it
does changes, we don't want to have to look through multiple tables to
ensure that the URLs are all up to date. Instead, we simply associate
uploads properly to `UserProfile` so that it does not have to replicate
the URLs in the table.