On Tue, Nov 1, 2011 at 12:20 PM, Samium Gromoff <skosyrev@common-lisp.net> wrote:
On Fri, 28 Oct 2011 11:29:38 -0400, Faré <fahree@gmail.com> wrote:
> Dear Christian,
>
> I'm interested in your web scraping technology in CL.
>
> I'd like to build a distributed web proxy that persistently records
> everything one views, so that you can always read and share the pages
> you like even when the author dies, the servers are taken off-line,
> the domain name is bought by someone else, and the new owner puts a
> new robots.txt that tells archive.org to not display the pages
> anymore.
>
> I don't know if this adventure tempts you, but I think the time is
> ripe for end-user-controlled peer-to-peer distributed archival and
> sharing of information. Obvious application, beyond archival, is a
> distributed facebook/g+ replacement.

I cannot add anything, but express an emphatic agreement.

One important thing, IMO, would be a mathematically-sound, peer-to-peer
archive authenticity co-verification -- perhaps in the same sense as
git manages to do it.


I agree.  It's becoming pretty obvious to me that the 'web' can be described as being in a state of constant rot and regrowth (sites go down. other sites go up). Unfortunately, the rot takes with it some really valuable pieces of information.

An interesting definition of a website might be to be actually a git repository - hyperlinks take both a file and a changeset hash the file was valid at; a 'certified' website might have a gpg signature on the commits as well.

One interesting application might be an 'archiving browser', which caches all/most of the sites you visit. Instead of rummaging through google trying to figure out what the search terms were to hit that one site (if it's still indexed by google and if it's still up), you can instead run a query on your local application.

As a personal project, I have been contemplating putting together a web spider/index for better web searching; it would be nice to contribute components from that to a larger project relating to web storage & archiving.

Regards,
Paul