Back to Question Center
0

Semalt: 3 Matakai Ga PHP Web Page Scraping

1 answers:

Gyara yanar gizo, wanda ake kira rakawar yanar gizo ko girbin yanar gizo, shine tsari na cire bayanai daga intanet ko blog. Ana amfani da wannan bayanin don saita alamomi meta, bayanan meta, kalmomi da kuma haɗi zuwa wani shafi, inganta ingantaccen aikinsa a sakamakon binciken binciken.

  • Fassarar daftarin aiki - Yana haɗa da takardun XML ko HTML wanda aka juyo zuwa DOM (Matakan Model Document ) fayiloli. PHP yana ba mu babbar girman DOM.
  • Sau da yawa kalma - Yana da hanyar hanyar cire bayanai daga shafukan yanar gizo a cikin hanyar maganganun yau da kullum.

Maganar tare da bayanan ɓangaren shafin yanar gizo na ɓangare na uku yana da alaka da haƙƙin mallaka saboda ba ka da izini don amfani da wannan bayanan. Amma tare da PHP, zaka iya sauke bayanan ba tare da matsalolin da aka haɗa da haƙƙin mallaka ba ko kuma inganci. A matsayin mai shiryawa na PHP, ƙila ka buƙaci bayanai daga shafukan yanar gizo daban-daban don dalilai na coding. A nan mun bayyana yadda za a samu bayanai daga wasu shafukan da kyau, amma kafin wannan, ya kamata ku tuna cewa a ƙarshe za ku sami ko dai index.php ko fayilolin scrape.js.

Mataki na 1: Ƙirƙirar tsari don shigar da shafin yanar gizon:

Da farko, ya kamata ka ƙirƙiri tsari a index.php ta latsa maɓallin sallama kuma shigar da shafin yanar gizon yanar gizon don lalata bayanai.


Shiga shafin yanar gizon URL don cire bayanai


Matakai2: Ƙirƙirar Halin Kwamfuta don Samun Bayanan Yanar Gizo:

Mataki na biyu shine ƙirƙirar Ayyukan PHP suna ɓoye cikin fayil scrape.php kamar yadda zai taimaka samun bayanai kuma amfani da ɗakin ɗakunan URL. Zai kuma ba ka damar haɗi da sadarwa tare da sabobin daban da ladabi ba tare da wani matsala ba..

aiki scrapeSiteData ($ website_url) {

idan (! Function_exists ('curl_init')) (

mutu (ba a shigar da CURL ba.) Shigar da kuma sake gwadawa. ');

}

$ curl = curl_init

;

curl_setopt ($ curl, CURLOPT_URL, $ website_url);

curl_setopt ($ curl, CURLOPT_RETURNTRANSFER, gaskiya);

$ fitarwa = curl_exec ($ curl);

curl_close ($ curl);

dawo da kayan fitarwa;

}

A nan, zamu iya ganin ko an shigar da CURL PHP daidai ko a'a. Dole a yi amfani da manyan mahimman bayanai guda uku a cikin yankunan da kuma curl_init

zai taimaka wajen farawa zaman, curl_exec

zai aiwatar da shi kuma curl_close

zai taimaka rufe haɗin. Ana amfani da masu canzawa irin su CURLOPT_URL don saita adireshin yanar gizon da muke buƙatar tsayar. CURLOPT_RETURNTRANSFER na biyu zai taimaka wajen adana shafukan da aka cire a cikin siffar muni maimakon ta tsohuwar tsari, wanda zai nuna duk shafin yanar gizon.

Mataki3: Bayyana Bayanai na Musamman daga Yanar Gizo:

Lokaci ya yi da za a rike ayyukan aiki na fayil na Fayil ɗin ku kuma shafe takamaiman sashe na shafin yanar gizonku. Idan ba ka so duk bayanan daga wani adireshi na musamman, ya kamata ka gyara ta amfani da maɓallin CURLOPT_RETURNTRANSFER kuma nuna haskaka sassan da kake so don karewa.

idan (daidai ($ _ POST ["sallama"])) (

$ html = scrapeWebsiteData ($ _ POST ['website_url']);

$ start_point = strpos ($ html, 'Bugawa Masu Zama');

$ end_point = strpos ($ html, ", $ start_point);

$ tsawon = $ end_point- $ start_point;

$ html = substr ($ html, $ start_point, $ tsawon);

echo $ html;

}

Muna ba da shawarar ka bunkasa ilmi na asali na PHP da kuma Bayanai na yau da kullum kafin ka yi amfani da waɗannan lambobin ko ka ɓoye wani blog ko shafin yanar gizon dalilai na sirri Source .

December 8, 2017