MadWorld wrote: ↑Tue Dec 14, 2021 12:01 am
CrystalVulpine wrote: ↑Mon Dec 13, 2021 11:12 pm
@
MadWorld Thank you so much! I'll definitely use this.
So you got the initially uploaded version by crawling the HTML I assume? I guess you said the API didn't show everything, that's strange.
Yes, it was converted from html. It was started after @
antiliberalsociety expressed his interest in preserving the data. The crawled version included pre-purge version of data.
After ruqqus went into "read-only" mode, there were inconsistency and large gabs between the pagination. The comments were fetched in parts to overlap the gabs. The submissions had fewer inconsistency in pagination. But at least one guild (+general) that I was aware of were not available on API. We can expect that the API data was incomplete, due to the changes admins had made.
@
MadWorld I have 13 posts in +general in the API version, so it must've been available. But you're right that it's missing some data, because I only have 498 posts in the API data whereas the HTML-crawled data includes 528, meaning the API excluded 30 of them. Maybe that's from posting in banned guilds, but I don't remember posting in banned guilds that often.
I have to wonder if they created the pagination gaps on purpose to destroy as much data as possible, since they also shut down way earlier than they said they would.
Update: Since each comment includes the original post within itself, most of the missing posts could be copied from there. So the only ones that have to be reconstructed from the HTML crawl are posts that were skipped by the API
and have no comments (or were purged).
Update 2: I recovered 23 extra posts from the comments. 21 of them were in +general, the 2 that weren't were both in +FreeForum, and one of those 2 was originally in +HiddenWebGems but kicked to +general and yanked to +FreeForum. So it appears that +general was accessible from the API, but extremely spotty (and maybe +FreeForum too, but it's only 2 posts so it could be a coincidence). 7 posts are still missing, I'll run another comparison to see what those look like.
Update 3: Bad news. There were 13 extra posts in the HTML data instead of 7, meaning at least 6 extra posts were present in the API submission and comment data but
not the HTML-crawled data. So unfortunately the HTML-crawled data is also incomplete. This means the API actually missed at least 36, giving me a total of at least 534 posts. But there could be a few more that are missing from all 3 sets of data.
Update 4: I noticed that precisely 6 of the 13 extra posts in the HTML data were in +general. I doubt that means anything though.
Update 5: This was a false alarm. If you take +general out of the equation, you get 503 (haha!) posts by me from the HTML crawl, and 485 from the API, 487 after including the posts embedded in comments. After adding the 7 in the HTML crawl, there are only 494.
This led me to recheck my data, and I was using an older file containing my posts from the HTML data that had any posts I had commented on added to it as well. To avoid grabbing other peoples' posts I put a condition in my script to only compare posts if I was the author; however, after the first 528 lines when other peoples' posts began, it also included several duplicates of some of my posts. I re-exported them, and only using the original 528, there were only 7 extra in the HTML data that weren't in the API data, as expected. So no, as of now the HTML crawl does not appear to have missed anything publicly accessible. 4 of the posts only available in the HTML data were in +general.