Public archive for Delving Bitcoin

jamesob · September 5, 2023, 5:42pm

It would be great to have a public archive of the posts here, so that

if this site goes away, we still have all the content, and
it is searchable by indexers like https://bitcoinsearch.xyz.

As far as I can tell, there are three approaches:

Scraping

Use some kind of crawler (selenium, wget, etc.) to manually create a backup by scraping the site.

I think this is not great given how JavaScript heavy Discourse is (i.e. it loads posts incrementally based on cursor position). It’s also brittle - scraping code is often complicated.
…but, it can be done by anyone, with no permissions necessary.

API

Use the Discourse API to pull content. This could be done either periodically as a full crawl or incrementally on a continuous basis.

I think this is a fine way to go assuming the API offers everything we need and someone can provision API credentials.
We may be able to get everything we need out of the list posts endpoint.

DB dump

Get a SQL dump of the database and run some kind of sanitization/export script on it every so often.

The upside: easy to be complete and the schema is probably very stable.
The downside: this can only be performed by administrators.

I’d say we should either go with API or DB, and then post the results on a git repo somewhere public (preferably a few places!).

Thoughts from the admins?

ajtowns · September 5, 2023, 6:12pm

Scraping seems to be what upstream recommends:

There’s also the /raw/ endpoint: https://delvingbitcoin.org/raw/87 and the api, eg https://delvingbitcoin.org/posts.json

There’s also Discourse Public Data Dump - developers - Discourse Meta because of course exporting data for AI training is a much higher priority than just information accessibility and continuity…

jamesob · September 5, 2023, 6:45pm

Oh, that’s great! I think these two might give us everything we need?

ajtowns · September 6, 2023, 1:38am

Might need to do something extra to also get any uploaded images / attachments?

Also need to be prepared to add ?page=2 etc if there’s more than 100 comments, eg https://meta.discourse.org/raw/69776?page=8. I think the API is rate limited kinda heavily by default, Available settings for global rate limits and throttling - sysadmin - Discourse Meta so may need some tweaking.

midnight · September 6, 2023, 5:49pm

Having sat there and considered significantly the longterm utility of logs and access to information that Bitcoin requires to defend itself—literally transparency is one of its best and most-resulted-in-rescues defences—I would like to also recommend that you make the archival and thus witnessing something that more than one person or process can perform.

On IRC, people can create and maintain logs longer-term on a per-person basis. This creates a much more robust and participation-based consensus on what constitutes the historical record.

I have found that this is an important facet of archival effectiveness as well. There are on occasion some attacker-injection problems that can be a problem for the safety of individuals, but I have also found that the most diligent and reliable sources of archival information are also the most reasonable and realistically practical people as well—so this tends to be an addressable problem.

Thus, may I suggest that the archival process itself be made available to individuals who are interested in participating.

Further, now that I’m thinking about it, I would like to point out that for those forums like Slack where the relevant historical archive is spotty, questionable, inaccessible, or otherwise opaque, these places are the sources of significant and ongoing attacker fuel—a simple propaganda-only example would be the ongoing nonsense about the “dragon’s den.”

jamesob · September 7, 2023, 3:53pm

Okay! I’ve come up with a minimum viable script for doing this.

Here’s the archive repository: GitHub - jamesob/delving-bitcoin-archive: A public archive of delvingbitcoin.org.

The most interesting contents are probably the rendered markdown listing of topic threads: https://github.com/jamesob/delving-bitcoin-archive/tree/master/archive/rendered-topics.
But I’ve also included a raw archive of the post JSON, which should be useful for search indexers: https://github.com/jamesob/delving-bitcoin-archive/tree/master/archive/posts.

The script for doing the actual archiving is here (and should be easily pip-installable by anyone wanting to reproduce this process for themselves): GitHub - jamesob/discourse-archive: Provides a simple archive of Discourse content

pip install discourse-archive
discourse-archive -u https://delvingbitcoin.org

The only caveat with this script is if someone updates an old post (older than a day), the update won’t be detected. I’m not sure if there’s a good solution for this, but maybe some other part of the API could clue us in to updated posts.