Experience report: nothing much is working, every useful action is failing or crashing #237

Closed
opened 2022-02-12 05:13:49 +00:00 by chris-morgan · 4 comments
chris-morgan commented 2022-02-12 05:13:49 +00:00 (Migrated from gitlab.com)

Here is a report from trying to set this up on chrismorgan.info.

Latest everything: conduit next (0cec421); Arch Linux, built with the PKGBUILD from https://github.com/S7evinK/conduit-archlinux/pull/1; current nightly rustc. Server has about 7GB out of 20GB disk space free, about 750MB out of 1GB RAM free, and one core, which sits close to idle (typical load factor <0.05). Straightforward nginx reverse proxy on 443 and 8448.

  1. I installed and started it.
  2. I went to app.element.io and created an account, @chris-morgan:chrismorgan.info. It reported a CORS error. I tried creating the account again, but it seemed that it had actually succeeded in creating the account. I logged in.
  3. I set allow_registration = false in my config file and restarted Conduit.
  4. I went to the user settings and went to enter an email address, and got another CORS error. See #188; in the short term it might be better to emit a CORS-ready error, hopefully Element could then say “not supported” or similar rather than “CORS error”. Similar on phone number.
  5. At some point I installed and switched to the desktop app, in the hope CORS stuff would stop. It didn’t. But the whole verification flow with its emoji and such worked properly.
  6. I tried to look at other rooms, but it said federation was disabled. I looked in perplexity at my config file, because it has the two adjacent lines # allow_encryption = false and # allow_federation = false, with the text before the first roughly saying that its default is true, and so I assumed that this meant that allow_federation also defaulted to true. Eventually I looked at the code, and no, federation is indeed false by default. Recommendation: where there are commented-out values in the sample config file, they should either clearly match the defaults, or clearly be different from the defaults (probably the former—though it’s risky in case of changing defaults, so that there can also be merit in just spelling everything out). Also a line break is needed between these two because they’re completely unrelated. And later in the file on proxy it refers to a source file which you can’t reasonably assume the reader has, and the path is wrong too. And I wish you didn’t have a bunch of slightly-divergent sample config files (conduit-example.toml, DEPLOY.md, debian/postinst; also other config stuff, and systemctl units, with DEPLOY.md duplicating yet deviating from debian/matrix-conduit.service). Anyway, I successfully enabled federation. Incidentally I would also note from looking at config.rs that it doesn’t include everything, even comparatively normal config like allow_room_creation.
  7. So far, I have not succeeded in joining a room. I tried previewing #rust:matrix.org, and gave up after ten minutes or so, observing no obvious activity of the server process via htop. I tried joining it too, and it failed. I think it crashed the service (status=4/ILL), with no output in the journal—certainly I’ve got quite a few crashes from various actions shown in the journal. I added conduit.rs to the list of servers; it does show all the rooms available, but as for previewing or joining #conduit:fachschaften.org, similarly no luck. I have attempted things like this multiple times, including long waits.
  8. I tried creating a room on my own server, Conduit crashed. Again each time SIGILL with nothing else in the journal.
  9. In the journal I observe an “Admin room must exist” panic from admin.rs:66 every time the server starts. I have no idea if this could be contributing to all the other failures. (Incidentally, nearby is .expect("#admins:server_name is a valid room alias") which is worded back to front: the argument to expect should not be the failed expectation, but the problem, which would be that it was not a valid room alias.)

I’m not giving up on Conduit, but would like to get this working, and so am happy to work further with you. Given that nothing useful is actually working yet, I’m quite content to scrap the db and start from scratch at this stage.

Here is a report from trying to set this up on chrismorgan.info. Latest everything: conduit next (0cec421); Arch Linux, built with the PKGBUILD from https://github.com/S7evinK/conduit-archlinux/pull/1; current nightly rustc. Server has about 7GB out of 20GB disk space free, about 750MB out of 1GB RAM free, and one core, which sits close to idle (typical load factor <0.05). Straightforward nginx reverse proxy on 443 and 8448. 1. I installed and started it. 2. I went to app.element.io and created an account, @chris-morgan:chrismorgan.info. It reported a CORS error. I tried creating the account again, but it seemed that it had actually succeeded in creating the account. I logged in. 3. I set `allow_registration = false` in my config file and restarted Conduit. 4. I went to the user settings and went to enter an email address, and got another CORS error. See #188; in the short term it might be better to emit a CORS-ready error, hopefully Element could then say “not supported” or similar rather than “CORS error”. Similar on phone number. 5. At some point I installed and switched to the desktop app, in the hope CORS stuff would stop. It didn’t. But the whole verification flow with its emoji and such worked properly. 6. I tried to look at other rooms, but it said federation was disabled. I looked in perplexity at my config file, because it has the two adjacent lines `# allow_encryption = false` and `# allow_federation = false`, with the text before the first roughly saying that its default is `true`, and so I assumed that this meant that `allow_federation` also defaulted to `true`. Eventually I looked at the code, and no, federation is indeed false by default. Recommendation: where there are commented-out values in the sample config file, they should either clearly match the defaults, or clearly be different from the defaults (probably the former—though it’s risky in case of changing defaults, so that there can also be merit in just spelling everything out). Also a line break is needed between these two because they’re completely unrelated. And later in the file on `proxy` it refers to a source file which you can’t reasonably assume the reader has, and the path is wrong too. And I wish you didn’t have a bunch of slightly-divergent sample config files (conduit-example.toml, DEPLOY.md, debian/postinst; also other config stuff, and systemctl units, with DEPLOY.md duplicating yet deviating from debian/matrix-conduit.service). Anyway, I successfully enabled federation. Incidentally I would also note from looking at config.rs that it doesn’t include everything, even comparatively normal config like `allow_room_creation`. 7. So far, I have not succeeded in joining a room. I tried previewing #rust:matrix.org, and gave up after ten minutes or so, observing no obvious activity of the server process via htop. I tried joining it too, and it failed. I think it crashed the service (status=4/ILL), with no output in the journal—certainly I’ve got quite a few crashes from various actions shown in the journal. I added conduit.rs to the list of servers; it does show all the rooms available, but as for previewing or joining #conduit:fachschaften.org, similarly no luck. I have attempted things like this multiple times, including long waits. 8. I tried creating a room on my own server, Conduit crashed. Again each time SIGILL with nothing else in the journal. 9. In the journal I observe an “Admin room must exist” panic from admin.rs:66 every time the server starts. I have no idea if this could be contributing to all the other failures. (Incidentally, nearby is `.expect("#admins:server_name is a valid room alias")` which is worded back to front: the argument to `expect` should not be the failed expectation, but the problem, which would be that it was *not* a valid room alias.) I’m not giving up on Conduit, but would like to get this working, and so am happy to work further with you. Given that nothing useful is actually working yet, I’m quite content to scrap the db and start from scratch at this stage.
chris-morgan commented 2022-02-12 08:34:56 +00:00 (Migrated from gitlab.com)

Hmm, I just figured I’d try it in a clean location, run directly rather than via systemd.

Turns out that what I built crashes on the first execution, presumably at some point in database initialisation, but without debug symbols I don’t know how to get anything useful out of it. I guess that could explain why there was no “Created new {} database with version {}” message in the journal.

I tried swapping from rocksdb to sqlite, same crash.

Tried running it on my laptop (which is where I built the package because it’s at least 20× as fast, probably more like 100×, for such purposes), and it didn’t crash.

The plot thickens. Building it on the server now. If it succeeds, I suppose something is doing compile-time CPU feature detection, which is a very bad idea. If it doesn’t, perhaps something is depending on too-recent CPU features and should be relaxed. (My server is with Vultr; /proc/cpuinfo says vendor_id GenuineIntel, family 6, model 61, model name Virtual CPU a7769a6388d5.) Or else I’m not sure yet, perhaps if it gets to that I can build it locally with debug symbols and see where it’s actually crashing.

So: the worst of my initial report here is probably all caused by one specific issue, though there are a few independent points.

I’ll post more in a year or three when the server finishes compiling and I’ve tried it out. (OK, so it’s actually almost up to linking, because I did half the build earlier before switching to my laptop, and I started the remainder probably only 20 or 30 minutes ago.)

Hmm, I just figured I’d try it in a clean location, run directly rather than via systemd. Turns out that what I built crashes on the first execution, presumably at some point in database initialisation, but without debug symbols I don’t know how to get anything useful out of it. I guess that could explain why there was no “Created new {} database with version {}” message in the journal. I tried swapping from rocksdb to sqlite, same crash. Tried running it on my laptop (which is where I built the package because it’s at least 20× as fast, probably more like 100×, for such purposes), and it *didn’t* crash. The plot thickens. Building it on the server now. If it succeeds, I suppose something is doing compile-time CPU feature detection, which is a very bad idea. If it doesn’t, perhaps something is depending on too-recent CPU features and should be relaxed. (My server is with Vultr; /proc/cpuinfo says vendor_id GenuineIntel, family 6, model 61, model name Virtual CPU a7769a6388d5.) Or else I’m not sure yet, perhaps if it gets to that I can build it locally with debug symbols and see where it’s actually crashing. So: the worst of my initial report here is probably all caused by one specific issue, though there are a few independent points. I’ll post more in a year or three when the server finishes compiling and I’ve tried it out. (OK, so it’s actually almost up to linking, because I did half the build earlier before switching to my laptop, and I started the remainder probably only 20 or 30 minutes ago.)
timokoesters commented 2022-02-12 08:35:47 +00:00 (Migrated from gitlab.com)
  1. These CORS errors are weird. Either your nginx does not forward all requests (OPTIONS requests) or strips away the headers that conduit sets.
  2. Email addresses (Identity servers) are not supported yet. The CORS error is misleading.
  3. Yes, the config files all seem to be different. We really need to fix this. conduit-example.toml is the correct config file, all others should look like that. Also see https://gitlab.com/famedly/conduit/-/merge_requests/283
  4. This is very odd. You should probably delete the database and try again, but I can't explain why it broke.
  5. The admin room error implies that Conduit crashed on the first start, before it was able to create the admin room.

I'm sorry for your horrible experience, I hope we can solve all these problems

2. These CORS errors are weird. Either your nginx does not forward all requests (OPTIONS requests) or strips away the headers that conduit sets. 4. Email addresses (Identity servers) are not supported yet. The CORS error is misleading. 6. Yes, the config files all seem to be different. We really need to fix this. conduit-example.toml is the correct config file, all others should look like that. Also see https://gitlab.com/famedly/conduit/-/merge_requests/283 7. This is very odd. You should probably delete the database and try again, but I can't explain why it broke. 9. The admin room error implies that Conduit crashed on the first start, before it was able to create the admin room. I'm sorry for your horrible experience, I hope we can solve all these problems
chris-morgan commented 2022-02-12 08:56:03 +00:00 (Migrated from gitlab.com)

OK, the build on the server finished, and it’s reaching “Created new rocksdb database with version 11” without a crash. This suggests to me that something is doing compile-time CPU feature detection, which is very undesirable as a default (though it can be a useful optional feature).

See you in #conduit:fachschaften.org soon, provided everything else goes smoothly!

OK, the build on the server finished, and it’s reaching “Created new rocksdb database with version 11” without a crash. This suggests to me that *something* is doing compile-time CPU feature detection, which is very undesirable as a *default* (though it can be a useful *optional* feature). See you in \#conduit:fachschaften.org soon, provided everything else goes smoothly!
timokoesters commented 2022-04-04 19:33:46 +00:00 (Migrated from gitlab.com)

What's the status of your server? Can I close this issue?

What's the status of your server? Can I close this issue?
Sign in to join this conversation.
No labels
Android
CS::needs customer feedback
CS::needs follow up
CS::needs on prem installation
CS::waiting
Chrome
Design:: Ready
Design:: in progress
Design::UX
E2EE
Edge
Firefox
GDPR
Iteration 13 IM
Linux
MacOS
Need::Discussion
Need::Steps to reproduce
Need::Upstream fix
Needs:: Planning
Needs::Dev-Team
Needs::More information
Needs::Priority
Needs::Product
Needs::Refinement
Needs::Severity
Priority::1-Critical
Priority::2-Max
Priority::3-Impending
Priority::4-High
Priority::5-Medium
Priority::6-Low
Priority::7-None
Progress::Backlog
Progress::Review
Progress::Started
Progress::Testing
Progress::Triage
Progress::Waiting
Reporter::Sentry
Safari
Target::Community
Target::Customer
Target::Internal
Target::PoC
Target::Security
Team:Customer-Success
Team:Design
Team:Infrastructure
Team:Instant-Messaging
Team:Product
Team:Workflows
Type::Bug
Type::Design
Type::Documentation
Type::Feature
Type::Improvement
Type::Support
Type::Tests
Windows
blocked
blocked-by-spec
cla-signed
conduit
contribution::advanced
contribution::easy
contribution::help needed
from::review
iOS
p::ti-tenant
performance
product::triage
proposal
refactor
release-blocker
s: dart_openapi_codegen
s::Famedly-Patient
s::Org-Directory
s::Passport-Generator
s::Requeuest
s:CRM
s:Famedly-App
s:Famedly-Web
s:Fhiroxide
s:Fhiroxide-cli
s:Fhiroxide-client
s:Fhirs
s:Hedwig
s:LISA
s:Matrix-Dart-SDK
s:Role-Manager
s:Synapse
s:User-Directory
s:WFS-Matrix
s:Workflow Engine
s:dtls
s:famedly-error
s:fcm-shared-isolate
s:matrix-api-lite
s:multiple-tab-detector
s:native-imaging
severity::1
severity::2
severity::3
severity::4
technical-debt
voip
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Matthias/conduit#237
No description provided.