論文

2012年

Extraction of relevant components using shallow structure of HTML documents

Proceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
  • Jun Zeng
  • ,
  • Brendan Flanagan
  • ,
  • Toshihiko Sakai
  • ,
  • Sachio Hirokawa

開始ページ
1186
終了ページ
1190
記述言語
掲載種別
研究論文(国際会議プロシーディングス)
DOI
10.1109/FSKD.2012.6234295
出版者・発行元
IEEE

As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment. © 2012 IEEE.

リンク情報
DOI
https://doi.org/10.1109/FSKD.2012.6234295
DBLP
https://dblp.uni-trier.de/rec/conf/fskd/ZengFSH12
Scopus
https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84872946925&origin=inward
Scopus Citedby
https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=84872946925&origin=inward
URL
https://dblp.uni-trier.de/conf/fskd/2012
URL
https://dblp.uni-trier.de/db/conf/fskd/fskd2012.html#ZengFSH12
ID情報
  • DOI : 10.1109/FSKD.2012.6234295
  • DBLP ID : conf/fskd/ZengFSH12
  • SCOPUS ID : 84872946925

エクスポート
BibTeX RIS