jamesbrown

I simply want to fetch the title,image and first paragraph of  article in a …

Thu 02 May, 2019 09:45 am

I simply want to fetch the title,image and first paragraph of  article in a website.

I research 2 methods which are RSS feeds and web scrapping.

RSS feed fetches(title, image and description)  but not all website offer image in their feed.

Web scrapping is a good idea but what how do one handle the issue of HTML structure change ?

Gurus in the house how can you handle this ?



Comments

goodmuyis

Thu 02 May, 2019 05:01 pm

You cannot help, well it is your job to keep your code up-to-date. Combine the two methods, then set logic to check if all expected elements are present (title, image and first paragraph). If any of them is missing log an error and discard content. The error log will help you to know which target site is broken.

OR the Hard Way:

=> Scape the 1st Heading <h1>, <h2>, <h3> after the <body> that will like be 80% chance page Title

=>Scape the 1st <img> tag

=> Scape the <p>

You might want to check out PHP Simple HTML DOM Parser for web scraping. it also support css-selector