html (string)articleBody (string)url (string)
"<html itemscope=\"\" itemtype=\"http://schema.org/WebPage\" xmlns=\"http://www.w3.org/1999/xhtml\" (...TRUNCATED)
"Gaming used to be so simple. We’d buy a game, sit down in front of a console or PC, grind our way(...TRUNCATED)
"https://www.wsj.com/articles/google-stadia-microsoft-xcloud-apple-arcade-so-many-ways-to-playand-pa(...TRUNCATED)
"<html lang=\"en-US\" itemid=\"https://www.nytimes.com/2019/11/19/opinion/republicans-elections-impe(...TRUNCATED)
"Americans have gone to the polls four times this month to vote in major, statewide races. In Virgin(...TRUNCATED)
"https://www.nytimes.com/2019/11/19/opinion/republicans-elections-impeachment.html"
"<html class=\"header-spacing spacing\"><head><!-- hearst/home/header_main.tpl -->\n\n <!-- e(...TRUNCATED)
"New electric vehicles, several new small SUVs, a redesigned compact car, a plug-in version of Toyot(...TRUNCATED)
"https://www.ctpost.com/news/us/article/New-SUVs-and-electric-vehicles-highlight-L-A-14848164.php"
"<html class=\"no-js\" lang=\"en-US\" xmlns:og=\"http://opengraphprotocol.org/schema/\" xmlns:fb=\"h(...TRUNCATED)
"(Reuters) — The New York State Attorney General (NYAG) is investigating WeWork, according to two (...TRUNCATED)
"https://venturebeat.com/2019/11/18/new-york-state-attorney-general-investigating-wework-and-former-(...TRUNCATED)
"<html lang=\"en-US\"><head>\n<meta charset=\"UTF-8\">\n<meta http-equiv=\"X-UA-Compatible\" content(...TRUNCATED)
"Volkswagen’s first ID.3 all-electric car based on the new MEB platform isn’t expected until nex(...TRUNCATED)
"https://www.slashgear.com/the-vw-id-space-vizzion-is-a-weird-ev-sports-wagon-with-a-secret-message-(...TRUNCATED)
"<html lang=\"en\"><head>\n <meta id=\"viewport\" name=\"viewport\" content=\"width=device-width,(...TRUNCATED)
"In case you are living in Delhi-NCR, chances are you have an app or two to find out the latest air (...TRUNCATED)
"https://www.newsnation.in/fact-check/news/fact-check-is-an-oxygen-bar-in-delhi-offering-fresh-air-f(...TRUNCATED)
"<html lang=\"en\" xmlns:og=\"http://opengraphprotocol.org/schema/\" xmlns:fb=\"http://ogp.me/ns/fb#(...TRUNCATED)
"The Steelers spent Monday trying to distance themselves from Thursday night's fight that led to mul(...TRUNCATED)
"https://www.cbssports.com/nfl/news/browns-player-on-mason-rudolphs-role-in-fight-with-myles-garrett(...TRUNCATED)
"<html class=\"ArticlePage\" lang=\"en-US\"><head>\n <meta charset=\"UTF-8\">\n\n <style data-(...TRUNCATED)
"Walt Disney Co. executive Kevin Mayer said overwhelming demand and a computer-coding glitch led to (...TRUNCATED)
"https://www.latimes.com/entertainment-arts/business/story/2019-11-19/disney-plus-kevin-mayer"
"<html data-ticker-type=\"sport\" data-display-leagues=\"\" data-display-days=\"7\" lang=\"en\"><hea(...TRUNCATED)
"MADRID — Rafael Nadal kept Spain’s hopes alive, then Marcel Granollers and Feliciano Lopez comp(...TRUNCATED)
"https://www.sportsnet.ca/tennis/argentina-comfortably-wins-davis-cup-opener-chile/"
"<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en-US\" xmlns:fb=\"http://ogp.me/ns/fb#\" prefi(...TRUNCATED)
"Senator representing Yobe North , Ahmad Lawan , on Tuesday moved a motion for the adjournment of th(...TRUNCATED)
"http://www.theparadigmng.com/2018/10/09/breaking-lawan-moves-motion-senates-adjournment-nzeribe-ade(...TRUNCATED)

Scrapinghub Article Extraction Benchmark

This dataset was originally created and distributed under MIT License by Scrapinghub on GitHub: github.com/scrapinghub/article-extraction-benchmark

It is mirrored on the HuggingFace Hub as a convenience.

Downloads last month
4
Edit dataset card
Evaluate models HF Leaderboard