Skip to content

Pynch is a library for parsing submissions from news.arc sites such as Hacker News and Arc Forum.

Notifications You must be signed in to change notification settings

ModernDude/pynch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pynch

This library needs to be upgraded to the latest version of clojure.

Pynch is a library for parsing submissions from news.arc sites such as Hacker News and Arc Forum.

Quick Start - Emacs/Lein/Swank

  1. Create leiningen project

    lein new testpynch

  2. Update testpynch/project.clj

(defproject testpynch "1.0.0-SNAPSHOT"
  :description "Test Pynch"
  :dependencies [[org.clojure/clojure "1.2.1"]
                 [pynch "0.1.0-alpha"]]
  :dev-dependencies [[swank-clojure "1.2.1"]])
  1. Update testpynch/src/testpynch/core.clj
(ns ptest.core
  (require [pynch.core :as py]))

(defn get-first-hn-sub []
  (-> "http://news.ycombinator.com" java.net.URI. py/get-subs first))
  1. Use lein to grab dependencies

    lein deps

  2. Start swank server

    lein swank

  3. Open core.clj in emacs and connect with slime

    M-x slime-connect

  4. Build

    C-c k

  5. At REPL

    user> (ns pynchtest.core)

    pynchtest.core> (get-first-hn-sub)

API

get-subs

Usage:

(get-subs res)
(get-subs res fields)

Returns a sequence of maps for each submission located at or within res. The type of res can be any of the following (String, java.io.FileInputStream, java.io.Reader, java.io.InputStream, java.net.URL, java.net.URI). The param fields is optional and can be used to specify a coll of fields that will be selected, extracted and returned from the function call. Each field must implement the FieldSpec protocol. If fields is not supplied, a default list of fields specified by *default-sub-fields* will be' used. (get-subs (java.net.URI. "http://news.ycombinator.com"))

get-subs-crawl

Usage:

(get-subs-crawl res)
(get-subs-crawl res fields)

Returns a lazy sequence of maps for each submission located at or within res followed by the submissions on the next page and so on. The function will return submissions as long as it can find a 'more' page to grab. The function will sleep for *crawl-delay* seconds in between each request. Just as with the get-subs function, an optional fields collection can specified to define what data get is returned.

Warning

Calling this function will most likely violate the robots.txt file on the target web server so I would recommend getting permission from the target proprietor before doing anything serious with this.

get-sub-details

Usage:

(get-sub-details res)
(get-sub-details res fields)

Returns details, including comments, for the submission located at or within res. The type of res can be any of the following (String, java.io.FileInputStream, java.io.Reader, java.io.InputStream, java.net.URL, java.net.URI). The param fields is optional and can be used to specify a coll of fields that will be selected, extracted and returned from function call. Each field must implement the FieldSpec protocol. If fields is not supplied, a default list of fields specified by *default-detail-fields* will be' used.

Available Fields

The following table describes the fields that come baked into pynch. The fields bound to the symbol sub-fields should be used with the get-subs* functions. All other fields should be used with the get-sub-details function.

BindingKeyTypeDefault
sub-fields:ordinalintfalse
sub-fields:pointsinttrue
sub-fields:sub-timeDatetrue
sub-fields:sub-urlstringtrue
sub-fields:userstringtrue
sub-fields:com-urlstringtrue
sub-fields:com-countinttrue
detail-fields:titlestringtrue
detail-fields:timeDatetrue
detail-fields:pointsinttrue
detail-fields:notesstringtrue
detail-fields:com-urlstringtrue
detail-fields:com-countinttrue
detail-fields:commentsList of comment-fieldstrue
comment-fields:userstringtrue
comment-fields:timedatetrue
comment-fields:cmnt-urlstringtrue
comment-fields:cmnt-textList of stringstrue
comment-fields:cmnt-nodesList of html nodesfalse

If no fields are specified in a function call, the default fields will be returned as defined above.

There are currently two ways to change the fields that are returned from a function.

  1. Rebind the default fields before calling a function.

    You can do this by rebinding any of the following symbols:

    • *default-sub-fields*
    • *default-detail-fields*
    • *default-comment-fields*

    Each symbol is bound to the appropriate list of keys from the table above. As an example, If I only wanted the points from submissions returned, I could do the following:

    (binding [py/*default-sub-fields* [:points]]
         (py/get-subs (java.net.URI. "http://news.ycombinator.com")))
    

    If you want to change the default comment fields returned on the get-sub-details call, this is the easiest method because the function only accepts detail fields and not comment fields.

  2. Pass in a list of fields that you want to select

    (py/get-subs (java.net.URI. "http://news.ycombinator.com")
             (py/get-field-specs [:points] py/sub-fields))

Extending

You can add in your own field definitions to be returned if you desire. You can even use this library to parse sequences over non news.arc sites although I'm not sure if anyone would find much use in that (see the select-fields function). Regardless, if you want to add-in a field or two you simply need to pass in an object into the fields collection that implements the following protocol:

(defprotocol FieldSpecifier
  "Describes how a field can identify, select and extract
   iteself from an html document."
  (get-selector [_])
  (extract-field [_ node])
  (get-key [_]))

  • get-selector must return a css selector as specified in the Enlive project https://github.com/cgrand/enlive
  • extract-field must extract the desired value from the selected dom node
  • get-key is how the field will identify itself in the returned map.

Known Issues

  1. If using swank-clojure you must use v1.2.1 or less. Please reference technomancy/swank-clojure#32

  2. If a submissions list source html is missing user, time, or comment count the output will not be correct. This is because each field selector is run independently and merged together. The solution assumes that each submission provides each piece of information. There are rare times when some information is not provided, in these cases the other fields will be mismatched.

License

Copyright (C) 2011 Jeff Sigmon

Distributed under the Eclipse Public License, the same as Clojure.

About

Pynch is a library for parsing submissions from news.arc sites such as Hacker News and Arc Forum.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published