Ruqqus public dataset

A subverse to post things for posterity.

Moderators: MadWorld, kestrel9

User avatar
MadWorld
Posts: 1229
Joined: Wed Dec 23, 2020 2:00 am
Topic points (SCP): 1276
Reply points (CCP): 2987

Re: Ruqqus public dataset

Post by MadWorld »

CrystalVulpine wrote: Tue Dec 14, 2021 1:09 am
MadWorld wrote: Tue Dec 14, 2021 12:01 am
CrystalVulpine wrote: Mon Dec 13, 2021 11:12 pm

@MadWorld Thank you so much! I'll definitely use this.

So you got the initially uploaded version by crawling the HTML I assume? I guess you said the API didn't show everything, that's strange.
Yes, it was converted from html. It was started after @antiliberalsociety expressed his interest in preserving the data. The crawled version included pre-purge version of data.

After ruqqus went into "read-only" mode, there were inconsistency and large gabs between the pagination. The comments were fetched in parts to overlap the gabs. The submissions had fewer inconsistency in pagination. But at least one guild (+general) that I was aware of were not available on API. We can expect that the API data was incomplete, due to the changes admins had made.
@MadWorld I have 13 posts in +general in the API version, so it must've been available. But you're right that it's missing some data, because I only have 498 posts in the API data whereas the HTML-crawled data includes 528, meaning the API excluded 30 of them. Maybe that's from posting in banned guilds, but I don't remember posting in banned guilds that often.

I have to wonder if they created the pagination gaps on purpose to destroy as much data as possible, since they also shut down way earlier than they said they would.

Update: Since each comment includes the original post within itself, most of the missing posts could be copied from there. So the only ones that have to be reconstructed from the HTML crawl are posts that were skipped by the API and have no comments (or were purged).

Update 2: I recovered 23 extra posts from the comments. 21 of them were in +general, the 2 that weren't were both in +FreeForum, and one of those 2 was originally in +HiddenWebGems but kicked to +general and yanked to +FreeForum. So it appears that +general was accessible from the API, but extremely spotty (and maybe +FreeForum too, but it's only 2 posts so it could be a coincidence). 7 posts are still missing, I'll run another comparison to see what those look like.

Update 3: Bad news. There were 13 extra posts in the HTML data instead of 7, meaning at least 6 extra posts were present in the API submission and comment data but not the HTML-crawled data. So unfortunately the HTML-crawled data is also incomplete. This means the API actually missed at least 36, giving me a total of at least 534 posts. But there could be a few more that are missing from all 3 sets of data.

Update 4: I noticed that precisely 6 of the 13 extra posts in the HTML data were in +general. I doubt that means anything though, if you remove +general you are still left with 7 extra posts in the HTML data with only 2 expected.
The names of the guilds been kicked might be versioned in my record. But this info is not helpful. If you remember the titles, you may be able to search on SVA. I hope you are able to recover the ones important to you.
CrystalVulpine
Posts: 8
Joined: Mon Dec 13, 2021 1:07 am
Topic points (SCP): 0
Reply points (CCP): 20

Re: Ruqqus public dataset

Post by CrystalVulpine »

MadWorld wrote: Tue Dec 14, 2021 3:32 am
CrystalVulpine wrote: Tue Dec 14, 2021 1:09 am
MadWorld wrote: Tue Dec 14, 2021 12:01 am

Yes, it was converted from html. It was started after @antiliberalsociety expressed his interest in preserving the data. The crawled version included pre-purge version of data.

After ruqqus went into "read-only" mode, there were inconsistency and large gabs between the pagination. The comments were fetched in parts to overlap the gabs. The submissions had fewer inconsistency in pagination. But at least one guild (+general) that I was aware of were not available on API. We can expect that the API data was incomplete, due to the changes admins had made.
@MadWorld I have 13 posts in +general in the API version, so it must've been available. But you're right that it's missing some data, because I only have 498 posts in the API data whereas the HTML-crawled data includes 528, meaning the API excluded 30 of them. Maybe that's from posting in banned guilds, but I don't remember posting in banned guilds that often.

I have to wonder if they created the pagination gaps on purpose to destroy as much data as possible, since they also shut down way earlier than they said they would.

Update: Since each comment includes the original post within itself, most of the missing posts could be copied from there. So the only ones that have to be reconstructed from the HTML crawl are posts that were skipped by the API and have no comments (or were purged).

Update 2: I recovered 23 extra posts from the comments. 21 of them were in +general, the 2 that weren't were both in +FreeForum, and one of those 2 was originally in +HiddenWebGems but kicked to +general and yanked to +FreeForum. So it appears that +general was accessible from the API, but extremely spotty (and maybe +FreeForum too, but it's only 2 posts so it could be a coincidence). 7 posts are still missing, I'll run another comparison to see what those look like.

Update 3: Bad news. There were 13 extra posts in the HTML data instead of 7, meaning at least 6 extra posts were present in the API submission and comment data but not the HTML-crawled data. So unfortunately the HTML-crawled data is also incomplete. This means the API actually missed at least 36, giving me a total of at least 534 posts. But there could be a few more that are missing from all 3 sets of data.

Update 4: I noticed that precisely 6 of the 13 extra posts in the HTML data were in +general. I doubt that means anything though, if you remove +general you are still left with 7 extra posts in the HTML data with only 2 expected.
The names of the guilds been kicked might be versioned in my record. But this info is not helpful. If you remember the titles, you may be able to search on SVA. I hope you are able to recover the ones important to you.
@MadWorld it was a false alarm, one of my temp files was corrupted and added duplicates. Without those there were only 7 posts only in the HTML data, as expected.
Last edited by CrystalVulpine on Tue Dec 14, 2021 4:06 am, edited 1 time in total.
CrystalVulpine
Posts: 8
Joined: Mon Dec 13, 2021 1:07 am
Topic points (SCP): 0
Reply points (CCP): 20

Re: Ruqqus public dataset

Post by CrystalVulpine »

Also, it looks like the admins retroactively autobanned posts with slurs in them. I figured it out because I found that innocent posts such as this one were admin-removed (and I don't remember them being removed before).
User avatar
antiliberalsociety
Posts: 2633
Joined: Wed Dec 23, 2020 2:00 am
Topic points (SCP): 3394
Reply points (CCP): 4462

Re: Ruqqus public dataset

Post by antiliberalsociety »

Slurs are calls for violence, and calls for violence are against the rules. 😆

That gives me great comfort, knowing they had to manually delete mine because I didn't use slurs very often. Most of my most damaging facts were on images not hosted on Ruqqus, and I don't see them having an algorithm for that. I did enjoy removing their masks.
User avatar
MadWorld
Posts: 1229
Joined: Wed Dec 23, 2020 2:00 am
Topic points (SCP): 1276
Reply points (CCP): 2987

Re: Ruqqus public dataset

Post by MadWorld »

captainmeta4 blocked me so that I can't reply to his posts (even though he still replies to mine). He's afraid of my dissenting. (ruqqus.com)

submitted 11 months ago by @CrystalVulpine to ruqqus +The_Cabal
https://ruqqus.com/+The\_Cabal/post/6il ... -so-that-i

[![](https://ruqqus.com/@captainmeta4/pic/profile)captainmeta4](https://ruqqus.com/@captainmeta4) has now blocked me. Of course he's an admin so he can still see my posts and reply to them, and indeed he is still doing so. Therefore the block only serves one purpose: so I can't reply back and give my side of the situation! His word will now be the final and only one from now on, and I assume he has done this because he's afraid of me exposing him and his discord shenanigans.

[This happened right after he threatened to ban me from the site for dissenting a second time](https://ruqqus.com/+RuqES/post/6c5g/ruq ... ?context=3). Daddy spez would be proud. If I end up banned, we know the truth.

I've now blocked him back. Unfortunately he is an admin and can get around it, and I'm sure he will like he does with guild exiles. But it serves a symbolic purpose.

direct
:lol: :lol: That was hilarious. I believe the admin of poal.co also has this feature.
User avatar
antiliberalsociety
Posts: 2633
Joined: Wed Dec 23, 2020 2:00 am
Topic points (SCP): 3394
Reply points (CCP): 4462

Re: Ruqqus public dataset

Post by antiliberalsociety »

MadWorld wrote: Tue Dec 14, 2021 5:12 am
captainmeta4 blocked me so that I can't reply to his posts (even though he still replies to mine). He's afraid of my dissenting. (ruqqus.com)

submitted 11 months ago by @CrystalVulpine to ruqqus +The_Cabal
https://ruqqus.com/+The\_Cabal/post/6il ... -so-that-i

[![](https://ruqqus.com/@captainmeta4/pic/profile)captainmeta4](https://ruqqus.com/@captainmeta4) has now blocked me. Of course he's an admin so he can still see my posts and reply to them, and indeed he is still doing so. Therefore the block only serves one purpose: so I can't reply back and give my side of the situation! His word will now be the final and only one from now on, and I assume he has done this because he's afraid of me exposing him and his discord shenanigans.

[This happened right after he threatened to ban me from the site for dissenting a second time](https://ruqqus.com/+RuqES/post/6c5g/ruq ... ?context=3). Daddy spez would be proud. If I end up banned, we know the truth.

I've now blocked him back. Unfortunately he is an admin and can get around it, and I'm sure he will like he does with guild exiles. But it serves a symbolic purpose.

direct
:lol: :lol: That was hilarious. I believe the admin of poal.co also has this feature.
Don't give The_Venereal any ideas...
User avatar
MadWorld
Posts: 1229
Joined: Wed Dec 23, 2020 2:00 am
Topic points (SCP): 1276
Reply points (CCP): 2987

Re: Ruqqus public dataset

Post by MadWorld »

antiliberalsociety wrote: Tue Dec 14, 2021 6:16 am Don't give The_Venereal any ideas...
:lol: Well, the vote and account suspension algorithm takes priority.
CrystalVulpine
Posts: 8
Joined: Mon Dec 13, 2021 1:07 am
Topic points (SCP): 0
Reply points (CCP): 20

Re: Ruqqus public dataset

Post by CrystalVulpine »

MadWorld wrote: Mon Nov 08, 2021 9:29 pm You could create a template out of ruqqus's static page and plug in the info available. :lol: It would be hilarious to see a near-identical page view on SearchVoat page.

You could even use "searchvoat.co/ruqqus/[original url without domain name]" to view SearchVoat's version of data.
@MadWorld It's a work-in-progress, but I'm actually using the jinja2 templates and plugging in the JSON data with nunjucks:

Image
User avatar
antiliberalsociety
Posts: 2633
Joined: Wed Dec 23, 2020 2:00 am
Topic points (SCP): 3394
Reply points (CCP): 4462

Re: Ruqqus public dataset

Post by antiliberalsociety »

CrystalVulpine wrote: Tue Dec 14, 2021 10:55 pm
MadWorld wrote: Mon Nov 08, 2021 9:29 pm You could create a template out of ruqqus's static page and plug in the info available. :lol: It would be hilarious to see a near-identical page view on SearchVoat page.

You could even use "searchvoat.co/ruqqus/[original url without domain name]" to view SearchVoat's version of data.
@MadWorld It's a work-in-progress, but I'm actually using the jinja2 templates and plugging in the JSON data with nunjucks:

Image
For the record, I tried it and got lost. They're right about the memory management. Perhaps sfrohne could break it down in a more noob friendly way.

I would love to see the comments applied to posts as @MadWorld said.
Post Reply