It would be great to have a public archive of the posts here, so that
- if this site goes away, we still have all the content, and
- it is searchable by indexers like https://bitcoinsearch.xyz.
As far as I can tell, there are three approaches:
Use some kind of crawler (selenium, wget, etc.) to manually create a backup by scraping the site.
- …but, it can be done by anyone, with no permissions necessary.
Use the Discourse API to pull content. This could be done either periodically as a full crawl or incrementally on a continuous basis.
- I think this is a fine way to go assuming the API offers everything we need and someone can provision API credentials.
- We may be able to get everything we need out of the list posts endpoint.
Get a SQL dump of the database and run some kind of sanitization/export script on it every so often.
- The upside: easy to be complete and the schema is probably very stable.
- The downside: this can only be performed by administrators.
I’d say we should either go with API or DB, and then post the results on a git repo somewhere public (preferably a few places!).
Thoughts from the admins?
Scraping seems to be what upstream recommends:
There’s also the
/raw/ endpoint: https://delvingbitcoin.org/raw/87 and the api, eg https://delvingbitcoin.org/posts.json
There’s also Discourse Public Data Dump - developers - Discourse Meta because of course exporting data for AI training is a much higher priority than just information accessibility and continuity…
Oh, that’s great! I think these two might give us everything we need?
Might need to do something extra to also get any uploaded images / attachments?
Also need to be prepared to add
?page=2 etc if there’s more than 100 comments, eg https://meta.discourse.org/raw/69776?page=8. I think the API is rate limited kinda heavily by default, Available settings for global rate limits and throttling - sysadmin - Discourse Meta so may need some tweaking.
Having sat there and considered significantly the longterm utility of logs and access to information that Bitcoin requires to defend itself—literally transparency is one of its best and most-resulted-in-rescues defences—I would like to also recommend that you make the archival and thus witnessing something that more than one person or process can perform.
On IRC, people can create and maintain logs longer-term on a per-person basis. This creates a much more robust and participation-based consensus on what constitutes the historical record.
I have found that this is an important facet of archival effectiveness as well. There are on occasion some attacker-injection problems that can be a problem for the safety of individuals, but I have also found that the most diligent and reliable sources of archival information are also the most reasonable and realistically practical people as well—so this tends to be an addressable problem.
Thus, may I suggest that the archival process itself be made available to individuals who are interested in participating.
Further, now that I’m thinking about it, I would like to point out that for those forums like Slack where the relevant historical archive is spotty, questionable, inaccessible, or otherwise opaque, these places are the sources of significant and ongoing attacker fuel—a simple propaganda-only example would be the ongoing nonsense about the “dragon’s den.”
Okay! I’ve come up with a minimum viable script for doing this.
Here’s the archive repository: GitHub - jamesob/delving-bitcoin-archive: A public archive of delvingbitcoin.org.
The script for doing the actual archiving is here (and should be easily pip-installable by anyone wanting to reproduce this process for themselves): GitHub - jamesob/discourse-archive: Provides a simple archive of Discourse content
pip install discourse-archive
discourse-archive -u https://delvingbitcoin.org
The only caveat with this script is if someone updates an old post (older than a day), the update won’t be detected. I’m not sure if there’s a good solution for this, but maybe some other part of the API could clue us in to updated posts.