Skip to content

lru0612/BlogCollector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

🚀 BlogCollector

BlogCollector is a naive AI/Tech blog aggregator that supports RSS feeds + web scraping, perfect for personal knowledge tracking and information monitoring.


📑 Table of Contents


Features

  • Multiple sources: RSS / Atom feeds & any webpage via custom CSS selectors
  • Category & filter: Organization / Individual, one-click filtering + search

Quick Start

1. Clone the repo

$ git clone https://github.com/<yourname>/BlogCollector.git
$ cd BlogCollector

2. Start the backend

$ cd backend
$ npm install
$ npm start     # default PORT=3000, can be overridden

3. Preview the frontend locally

$ cd docs              # static assets now live in docs/
$ npx serve .           # or python3 -m http.server 8080

Then visit http://localhost:8080.

💡 VS Code users can also install Live Server and choose Open with Live Server.


Project Structure

BlogCollector/
├─ backend/            # Node.js / Express backend
│  ├─ server.js
│  └─ ...
├─ docs/               # static frontend (published via GitHub Pages)
│  ├─ index.html
│  ├─ script.js
│  └─ style.css
└─ README.md

Deploy to the Web (Render + GitHub Pages)

Part Platform Steps
Backend Render Connect the repo → New Web Service → root dir backend → Build npm install / Start npm start → get https://<app>.onrender.com
Frontend GitHub Pages Settings → Pages → Source main / Folder /docs → Save → access https://<user>.github.io/<repo>/

Update docs/script.js:

const API_BASE_URL = 'https://<app>.onrender.com/api';

Now anyone can open the GitHub Pages URL and the site will call your Render API.


Add / Modify Data Sources

1. RSS sources

Append entries to the rssSources array in backend/server.js:

const rssSources = [
  { name: 'OpenAI', url: 'https://openai.com/blog/rss.xml', category: 'organization' },
  // new source
  { name: 'Example Blog', url: 'https://example.com/rss.xml', category: 'individual' },
];

2. Scraping targets

Append entries to the scrapingTargets array in server.js:

const scrapingTargets = [
  {
    name: 'Lilian Weng',
    url: 'https://lilianweng.github.io/',
    category: 'individual',
    selectors: {
      articleContainer: 'article.post-entry',
      title: '.entry-header h2',
      link: 'a.entry-link',
      description: 'section.entry-content p',
      time: 'footer.entry-footer',
    },
  },
  // new source example
  {
    name: 'Karpathy',
    url: 'https://karpathy.bearblog.dev/blog/',
    category: 'individual',
    selectors: {
      articleContainer: 'ul.blog-posts li',
      title: 'a',
      link: 'a',
      description: '',      // this site has no summary
      time: 'time',
    },
  },
];

After editing sources, restart the backend:

$ cd backend && npm restart

FAQ

Issue Solution
Port in use Change the PORT env var or free port 3000
CORS error CORS is enabled globally; update the whitelist if you set a CDN
Scrape fail Check anti-bot measures & verify your CSS selectors

Contributing & License

  • Pull requests, issues and stars are welcome! 🌟
  • Released under the MIT License — free for personal & commercial use.

About

A website collect blogs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors