글 수정

작성자 본인만 수정할 수 있습니다

← 홈으로
익명 · 2023.07.15 · 조회 44,455
php, gzip을 이용한 웹사이트 크롤러 (크롤링 프로그램)
<p></p><p></p><p></p><p>미리보기: <del>지원 종료됨</del></p><p><br></p><p>입력한 웹사이트의 이미지,html,css,js,모든 link를 크롤링하며, 크롤링한 링크는 link.txt에 저장됩니다.</p><p>Produced by Tak2을 되도록이면 삭제하지 말아주세요<br></p><p><br></p><p>&nbsp; &nbsp; function getDomain($url) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; $parsedUrl = parse_url($url);</p><p>&nbsp; &nbsp; &nbsp; &nbsp; return $parsedUrl['scheme'] . '://' . $parsedUrl['host'];</p><p><br></p><p>을</p><p><br></p><p>function isCrawlingAllowed($url) {</p><p>&nbsp; &nbsp; $parsedUrl = parse_url($url);</p><p>&nbsp; &nbsp; $robotsUrl = $parsedUrl['scheme'] . '://' . $parsedUrl['host'] . '/robots.txt';</p><p><br></p><p>&nbsp; &nbsp; $robotsContent = @file_get_contents($robotsUrl);</p><p>&nbsp; &nbsp; if ($robotsContent === false) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; return true; // robots.txt 파일이 없는 경우 크롤링 허용</p><p>&nbsp; &nbsp; }</p><p><br></p><p>&nbsp; &nbsp; $allow = true;</p><p>&nbsp; &nbsp; $disallowPaths = array();</p><p>&nbsp; &nbsp; $lines = explode("\n", $robotsContent);</p><p>&nbsp; &nbsp; foreach ($lines as $line) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; if (strpos($line, 'Disallow:') === 0) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $disallowPath = trim(substr($line, strlen('Disallow:')));</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (!empty($disallowPath)) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $disallowPaths[] = $disallowPath;</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p><p>&nbsp; &nbsp; &nbsp; &nbsp; }</p><p>&nbsp; &nbsp; }</p><p><br></p><p>&nbsp; &nbsp; // 확인하려는 경로가 Disallow 경로인지 체크</p><p>&nbsp; &nbsp; foreach ($disallowPaths as $path) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; if (strpos($url, $path) !== false) {</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $allow = false;</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break;</p><p>&nbsp; &nbsp; &nbsp; &nbsp; }</p><p>&nbsp; &nbsp; }</p><p><br></p><p>&nbsp; &nbsp; return $allow;</p><p><br></p><p>로 수정해야 합법적으로 크롤링 할 수 있습니다.</p><p><br></p><p>*업데이트 버전:<a href="https://dsclub.kr/bbs/board.php?bo_table=code&amp;wr_id=297" style="font-family: &quot;Helvetica Neue&quot;, sans-serif; font-size: 13px; background-color: rgb(255, 255, 255);">https://dsclub.kr/bbs/board.php?bo_table=code&amp;wr_id=297</a></p><p></p><p></p>