Hi,
I use CL to scrape several comic websites and generate a website that collects the daily strips from that. The (small) program's features:
* Easy definition of comic sources * Uses xpath to get the comics * Stores an archive of daily comics * Generate web pages with comics per day
A feature that is IMHO still missing is easier (maybe interactive) comic specification. Currently you better get your xpath right.
I think it would be possible to promote CL with an application like that.
Could you have a look at the code and give me some hints on style and CL in general, so that the code actually becomes good enough for that purpose?
You can find it there: https://github.com/Neronus/Lisp-Utils/blob/master/comics.lisp And an example of the generated output here: http://christian.ftwca.de/comics/
Thank you,
Christian
On Tue, Oct 18, 2011 at 19:15, Christian von Essen christian@mvonessen.de wrote:
Hi,
I use CL to scrape several comic websites and generate a website that collects the daily strips from that. The (small) program's features:
- Easy definition of comic sources
- Uses xpath to get the comics
- Stores an archive of daily comics
- Generate web pages with comics per day
A feature that is IMHO still missing is easier (maybe interactive) comic specification. Currently you better get your xpath right.
I think it would be possible to promote CL with an application like that.
Could you have a look at the code and give me some hints on style and CL in general, so that the code actually becomes good enough for that purpose?
You can find it there: https://github.com/Neronus/Lisp-Utils/blob/master/comics.lisp And an example of the generated output here: http://christian.ftwca.de/comics/
Thank you,
Christian
Dear Christian,
I'm interested in your web scraping technology in CL.
I'd like to build a distributed web proxy that persistently records everything one views, so that you can always read and share the pages you like even when the author dies, the servers are taken off-line, the domain name is bought by someone else, and the new owner puts a new robots.txt that tells archive.org to not display the pages anymore.
I don't know if this adventure tempts you, but I think the time is ripe for end-user-controlled peer-to-peer distributed archival and sharing of information. Obvious application, beyond archival, is a distributed facebook/g+ replacement.
PS: in shelisp, maybe you could use xcvb-driver:run-program/process-output-stream instead of yet another partial run-program interface. I really would like to see half-baked portability layers die. If you really need input as well as output, I could hack that into the xcvb-driver utility.
—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org
On Fri, 28 Oct 2011 11:29:38 -0400, Faré fahree@gmail.com wrote:
Dear Christian,
I'm interested in your web scraping technology in CL.
I'd like to build a distributed web proxy that persistently records everything one views, so that you can always read and share the pages you like even when the author dies, the servers are taken off-line, the domain name is bought by someone else, and the new owner puts a new robots.txt that tells archive.org to not display the pages anymore.
I don't know if this adventure tempts you, but I think the time is ripe for end-user-controlled peer-to-peer distributed archival and sharing of information. Obvious application, beyond archival, is a distributed facebook/g+ replacement.
I cannot add anything, but express an emphatic agreement.
One important thing, IMO, would be a mathematically-sound, peer-to-peer archive authenticity co-verification -- perhaps in the same sense as git manages to do it.
On Tue, Nov 1, 2011 at 12:20 PM, Samium Gromoff skosyrev@common-lisp.netwrote:
On Fri, 28 Oct 2011 11:29:38 -0400, Faré fahree@gmail.com wrote:
Dear Christian,
I'm interested in your web scraping technology in CL.
I'd like to build a distributed web proxy that persistently records everything one views, so that you can always read and share the pages you like even when the author dies, the servers are taken off-line, the domain name is bought by someone else, and the new owner puts a new robots.txt that tells archive.org to not display the pages anymore.
I don't know if this adventure tempts you, but I think the time is ripe for end-user-controlled peer-to-peer distributed archival and sharing of information. Obvious application, beyond archival, is a distributed facebook/g+ replacement.
I cannot add anything, but express an emphatic agreement.
One important thing, IMO, would be a mathematically-sound, peer-to-peer archive authenticity co-verification -- perhaps in the same sense as git manages to do it.
I agree. It's becoming pretty obvious to me that the 'web' can be described as being in a state of constant rot and regrowth (sites go down. other sites go up). Unfortunately, the rot takes with it some really valuable pieces of information.
An interesting definition of a website might be to be actually a git repository - hyperlinks take both a file and a changeset hash the file was valid at; a 'certified' website might have a gpg signature on the commits as well.
One interesting application might be an 'archiving browser', which caches all/most of the sites you visit. Instead of rummaging through google trying to figure out what the search terms were to hit that one site (if it's still indexed by google and if it's still up), you can instead run a query on your local application.
As a personal project, I have been contemplating putting together a web spider/index for better web searching; it would be nice to contribute components from that to a larger project relating to web storage & archiving.
Regards, Paul
On Tue, 1 Nov 2011 18:32:01 -0700 Paul Nathan pnathan.software@gmail.com wrote:
One interesting application might be an 'archiving browser', which caches all/most of the sites you visit. Instead of rummaging through google trying to figure out what the search terms were to hit that one site (if it's still indexed by google and if it's still up), you can instead run a query on your local application.
As a personal project, I have been contemplating putting together a web spider/index for better web searching; it would be nice to contribute components from that to a larger project relating to web storage & archiving.
I really like this idea. There exist a few distributed spider+search engine projects which could perhaps one day with enough participants allow to replace commercial search engines, while permitting unrestricted searches (ever noticed how the public google search interface used to be more powerful, but was "censored" since?). Unfortunately, those projects are yet unpopular and could not at all compete at current time.
A distributed archiving system could also embed such a distributed search engine...