Dataset Preview
Viewer
The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
An error occurred while generating the dataset
Error code:   UnexpectedError

Need help to make the dataset viewer work? Open a discussion for direct support.

url
string
redirects
int64
not_indexed_by_google
int64
issuer
string
certificate_age
int64
email_submission
int64
request_url_percentage
float64
url_anchor_percentage
float64
meta_percentage
float64
script_percentage
float64
link_percentage
float64
mouseover_changes
int64
right_click_disabled
int64
popup_window_has_text_field
int64
use_iframe
int64
has_suspicious_port
int64
external_favicons
int64
TTL
int64
ip_address_count
int64
TXT_record
int64
check_sfh
float64
count_domain_occurrences
int64
domain_registration_length
int64
abnormal_url
int64
age_of_domain
int64
is_malicious
float64
page_rank_decimal
float64
"http://www.niedziela.pl/artykul/39133/eksperci-apeluja-do-polskich-wladz-zlobki"
1
0
null
0
0
0
0.592179
0
0.424242
0.575758
0
0
0
0
0
0
30
1
0
0
189
0
0
-1
0
5.07
"http://www.exquisitedesires.com/~xerge/6e3bfba368e4e411f0ea467231c8567a/index.php"
1
0
null
0
0
0
0
0
0.5
0.5
0
0
0
0
1
0
30
2
0
0.5
0
0
0
-1
1
2.77
"http://afex.biz/gmail_verificar/ServiceLoginAuth/fwd/"
0
0
null
0
0
0
0
0
0.666667
0.333333
0
0
0
0
1
1
30
2
0
0
0
0
0
-1
1
null
"https://asmbs.org/chapters/virginia"
0
0
"US"
-88
0
0
0
0
0.35
0.65
0
0
0
0
0
0
30
1
0
0
221
0
0
-1
0
5.14
"http://www.whathifi.com/canton/dm100/review"
1
0
null
0
0
0
0
0
0.333333
0.666667
0
0
0
0
1
2
5
1
0
0
80
0
0
-1
0
5.47
"https://www.tdsb.on.ca/portals/_default/upcoming_event/guest%20speaker%20-tess_paye_febuary%202019.pdf"
0
0
"US"
-145
0
1
0
0
0
0
0
0
0
0
1
0
30
1
0
0
0
0
0
-1
0
5.25
"http://www.the-linde-group.com/de/corporate_responsibility/employees_and_society/competing_for_talent/index.html"
1
0
null
0
0
0
0
0
0.294118
0.705882
0
0
0
1
1
1
20
2
0
0
0
0
0
-1
0
4.63
"https://homesmart.com/real-estate-agent/california/palmdesert/44759-thomas-tucker/smart_partners"
0
0
"US"
-316
0
0
0.416667
0
0.622222
0.377778
0
0
0
1
1
1
30
3
0
0
15
0
0
-1
0
5.36
"https://www.books-sanseido.co.jp/events/538935/%e6%a3%ae%e8%a6%8b%e3%81%95%e3%82%93%e8%bf%91%e5%bd%b1"
0
0
"JP"
-266
0
0
0
0
0.583333
0.416667
0
0
0
0
0
0
30
1
0
0.5
37
0
0
-1
0
4.46
"http://tinyurl.com/zkxg4le"
1
0
null
0
0
0
0
0
0.25
0.75
0
0
0
1
1
0
30
3
0
0
4
0
0
-1
1
8.17
"https://www.finalsite.com/design/portfolio/~board/portfolio-2018/post/american-school-of-bombay"
1
0
"US"
-284
0
0
0
0
0.615385
0.384615
0
0
0
1
1
1
30
5
0
0
7
0
0
-1
0
4.94
"https://www.nationalenquirer.com/photos/lisa-marie-presley-broke-scandal/"
0
0
"US"
-318
0
0
0.25
0
0.266667
0.733333
0
0
0
1
1
5
30
4
0
0.5
141
0
0
-1
0
5.03
"https://themes4wp.com/contact/"
0
0
"US"
-32
0
0
0.976744
0
0.361111
0.638889
0
0
0
0
1
0
30
2
0
0
127
0
0
-1
0
4.91
"http://www.naphill.org/product-category/napoleon-hill-classics/?product_orderby=date"
0
0
null
0
0
0
0
0
0.689922
0.310078
0
0
0
1
1
0
30
1
0
0
390
0
0
-1
0
5.1
"http://www.dominios.com.co/buscar/?tld=.brujasdecartagena.com.co&domain=www"
1
0
null
0
0
0
0
0
0.681818
0.318182
0
0
0
1
1
1
30
2
0
0
0
0
0
-1
0
3.4
"http://ch.ai/2010/02/06/cheeky-yorkshire-tea-commercial-from-the-uk/"
0
0
null
0
0
0
0.068182
0
0.307692
0.692308
0
0
0
0
0
0
30
1
0
0
134
0
0
-1
0
2.89
"https://www.cweonline.org/about-cwe/cwe-eastern-massachusetts/grow"
1
0
"US"
-83
1
0
0
0
0.8
0.2
0
0
0
0
1
0
30
1
0
0.5
155
0
0
-1
0
4.78
"https://www.oswego.edu/atmospheric-geological-sciences/opportunities"
0
0
"BE"
-390
0
0
0
0
0.434783
0.565217
0
0
0
1
1
1
30
1
0
0
138
0
0
-1
0
5.23
"https://wild-about-travel.com/oceania/"
1
0
"US"
-60
0
0
0
0
0.333333
0.666667
0
0
0
1
1
0
30
2
0
1
224
0
0
-1
0
4.3
"https://www.leylobby.gob.cl/instituciones/mu239/cargos-pasivos/133528/donativos"
0
0
"US"
-163
0
0
0.066667
0
0.666667
0.333333
0
0
0
0
1
0
29
3
0
0
29
0
0
-1
0
3.99
"http://livebuzz.co.uk/visa/[email protected]?done=1"
3
0
null
0
0
0
0
0
0.472222
0.527778
0
0
0
0
1
5
30
2
0
0
7
0
0
-1
1
3.21
"https://www.youngandreckless.com/products/veronique-tie-back-bikini-top"
1
0
"US"
-47
0
0
0
0
0.625
0.375
0
0
0
0
1
0
30
1
0
0.5
339
0
0
-1
0
4.5
"https://www.webhostingpad.com/awards/"
0
0
"GB"
-343
0
0
0
0
0.5
0.5
0
0
0
0
0
1
30
1
0
0
1
0
0
-1
0
5.05
"https://www.glomarr.com/news/happy-april"
0
0
"US"
-66
0
0
0
0
0.411765
0.588235
0
0
0
0
1
0
30
1
0
0
13
0
0
-1
0
3.63
"https://www.ericmmartin.com/donate/"
0
0
"US"
-85
0
0
0
0
0.333333
0.666667
0
0
0
0
1
1
30
1
0
0.5
77
0
0
-1
0
5.28
"https://www.gotokyo.org/en/spot/856/"
0
0
"BE"
-176
0
0
0.16092
0
0.428571
0.571429
0
0
0
1
0
1
30
1
0
0
10
0
0
-1
0
5.33
"http://www.librairiedialogues.fr/livre/4098802-voir-les-champignons-spooner-brian-flammarion?affiliate=d_flammarion"
2
0
null
0
0
0
0.232143
0
0.6
0.4
0
0
0
1
0
0
29
1
0
0.5
10
0
0
-1
0
4.61
"http://www.chinaqw.com/hqhr/2019/02-08/214934.shtml"
0
0
null
0
0
0
0
0
0.75
0.25
0
0
0
0
1
0
29
1
0
0
36
0
0
-1
0
5.07
"http://www.calit2.net/people/detail.php?id=276"
0
0
null
0
0
0
0.058824
0
0.5
0.5
0
0
0
0
0
0
30
1
0
0
6
0
0
-1
0
5.04
"http://aleberth.addr.com/x2yZ1scde/webscr_prim.php?YWxlYmVydGguYWRkci5jb20=uhsdsusu5485757kUJHNN546221oPLKj988777AOP784MTM0Njg1MzUyNQ="
0
0
null
0
0
0
0.916667
0
0.75
0.25
0
0
0
0
1
1
30
1
0
0
12
0
0
-1
1
null
"https://www.seetickets.us/event/the-sonics-w-sailor-poon-and-the-hickoids/366431"
3
0
"US"
-294
0
0
0.827586
0
0.782609
0.217391
0
0
0
1
1
1
28
3
0
0
7
0
0
-1
0
5.12
"http://www.play-well.org/about-lego-birthday-parties.shtml"
1
0
null
0
0
0
0
0
0.764706
0.235294
0
0
0
0
0
0
30
1
0
0
8
0
0
-1
0
4.21
"https://mises.org/es/search/site/author/node%3a1194/library/interviews-367"
1
0
"US"
-40
0
0
0
0
0.357143
0.642857
0
0
0
1
1
0
30
1
0
0.5
13
0
0
-1
0
5.72
"https://www.take-a-screenshot.org/en/about.html"
0
0
"US"
-51
0
0
0
0
1
0
0
0
0
0
1
0
30
1
0
0
0
0
0
-1
0
4.88
"http://thornbridgebrewery.com/pp/[email protected]/it/webapps/mpp/home#"
2
0
null
0
0
0
0
0
0.240506
0.759494
0
0
0
1
1
1
30
1
0
0.5
0
0
0
-1
1
4.35
"https://performancein.com/news/2014/11/06/how-retailers-can-best-use-data-preperation-cyber-monday/"
0
0
"US"
-58
0
0
0.213675
0
0.4
0.6
0
0
0
1
1
3
30
1
0
0
219
0
0
-1
0
5.01
"https://www.theawl.com/2013/12/topless-geraldo/"
0
0
"US"
-71
0
0
0.370968
0
0.25
0.75
0
0
0
0
0
0
30
1
0
0.5
60
0
0
-1
0
5.57
"https://www.skeptic.com/eskeptic/07-08-22/"
0
0
"US"
-62
0
0
0.263566
0
0.263158
0.736842
0
0
0
0
1
2
30
2
0
0
71
0
0
-1
0
5.33
"https://www.intelliprice.com/intellipricedealer/start.htm?dealerid=1141003&dealerpacode=09801&secondaryleadsource=elite%20website&secondaryid=desktop&primaryleadsource=dc%20trade-in&vendorbrand=ford"
0
0
"US"
-116
0
1
0
0
1
0
0
0
0
1
1
0
30
4
0
0
0
0
0
-1
0
4.04
"http://pandawhale.com/post/8014/the-very-best-grumpy-cat-gifs"
0
0
null
0
0
1
0
0
0
0
0
0
0
0
0
0
29
1
0
0
0
0
0
-1
0
4.8
"http://consolidatedfuneralservices.com/"
3
0
null
0
0
0
0
0
0.333333
0.666667
0
0
0
1
1
1
30
2
0
0
0
0
0
-1
0
3.25
"https://www.fyber.com/announcements/fyberfalkpressrelease.pdf"
1
0
"US"
-286
0
0
0.24359
0
0.263158
0.736842
0
0
0
1
1
3
29
2
0
0
0
0
0
-1
0
5.25
"https://bit.parts/entry/categories/blog?start=15"
0
0
"US"
-50
0
0
0
0
0.255319
0.744681
0
0
0
1
0
5
29
1
0
0
369
0
0
-1
0
3.98
"http://www.alltop.com/science"
3
0
null
0
0
0
0
0
0.5
0.5
0
0
0
1
1
2
29
2
0
0
3
0
0
-1
0
6.6
"https://www.lexblog.com/2018/10/11/cma-study-into-statutory-audits/"
0
0
"US"
-283
0
0
0
0
0.454545
0.545455
0
0
0
1
1
0
28
3
0
0
129
0
0
-1
0
4.88
"https://www.mst.edu/~matadvan"
3
0
"US"
-198
0
1
0
0
0
1
0
0
0
0
1
0
30
1
0
0
0
0
0
-1
0
5.19
"https://www.tidyverse.org/articles/2018/01/tibble-1-4-2/"
1
0
"US"
-34
0
0
0
0.066667
0.333333
0.6
0
0
0
0
1
4
19
2
0
0
3
0
0
-1
0
5.19
"https://www.gerritcodereview.com/"
0
0
"US"
-90
0
0
0
0
0.4375
0.5625
0
0
0
0
1
1
30
2
0
0
5
0
0
-1
0
5
"https://eventregist.com/p/mashingup?lang=th_th"
0
0
"US"
-200
0
0
0
0
0.608696
0.391304
0
0
0
0
1
1
30
4
0
0.5
16
0
0
-1
0
6.31
"https://petergreenberg.com/2017/08/07/luxe-lavs-new-hotels/"
0
0
"US"
-307
0
0
0.159664
0
0.479167
0.520833
0
0
0
0
1
0
30
1
0
0
499
0
0
-1
0
5.07
"http://bradfello.ws/wp-admin/includes/www.alibaba.com/login.jsp.htm"
0
0
null
0
0
0
0
0
0
0
0
0
0
0
1
0
30
1
0
0
3
0
0
-1
1
null
"http://lbpol.postedi.com/index.php?MfcISAPICommand=SignInFPP&amp"
1
0
null
0
0
0
0
0
0.5
0.5
0
0
0
0
1
0
30
2
0
0.5
0
0
0
-1
1
null
"http://minecraftm.com/tag/minecraft-appdata/"
0
0
null
0
0
0
0.083333
0
0.32
0.68
0
0
0
0
0
0
30
1
0
0
44
0
0
-1
0
5
"https://www.nrcdv.org/rhydvtoolkit/common-ground/"
0
0
"US"
-65
0
0
0.142857
0
0.333333
0.666667
0
0
0
0
0
0
30
1
0
0
15
0
0
-1
0
5.09
"https://www.goodgallery.com/faq-items/canonical-tags/"
0
0
"GB"
-105
0
1
0
0
0.739726
0.260274
0
0
0
0
1
4
30
2
0
0
3
0
0
-1
0
4.24
"http://www3.vwa.nl/foto/ambrosia_foto_nvwa_nummer_2.jpg"
0
0
null
0
0
1
0
0
0
0
0
0
0
0
1
0
30
1
0
0
0
0
0
-1
0
3.12
"https://www.rushessay.com/our_process.php"
0
0
"US"
-11
0
0
0
0
0.461538
0.538462
0
0
0
1
1
1
30
1
0
0.5
2
0
0
-1
0
4.7
"https://www.exchangewire.com/blog/category/header-bidding/page/2/"
0
0
"GB"
-36
0
0
0
0
0.208333
0.791667
0
0
0
1
0
0
30
1
0
0
162
0
0
-1
0
5.23
"http://www.ocert.org/patches/exslt_crypt.patch"
1
0
null
0
0
1
0
0
0
0
0
0
0
0
1
0
30
4
0
0
0
0
0
-1
0
4.94
"https://metromode.se/sitemap-pt-horoskop-2017-05.html"
0
0
"US"
-297
0
0
0.846154
0
0
0
0
0
0
0
1
0
30
3
0
0
8
0
0
-1
0
5.25
"http://www.saeima.lv/lv/likumdosana/saeimas-sedes"
1
0
null
0
0
0
0
0
0.526316
0.473684
0
0
0
0
0
2
27
1
0
0.5
7
0
0
-1
0
5.24
"https://www.apec.org/publications/2012/08/marine-microorganisms-capacity-building-for-a-broader-cooperative-research-and-utilization"
0
0
"US"
-309
0
0
0
0
0.533333
0.466667
0
0
0
1
1
2
4
3
0
0.5
3
0
0
-1
0
5.59
"http://www.xuetangx.com/courses/course-v1:tsinghuax+00690212x-2+2017_t2/about"
2
0
null
0
0
1
0
0
0.333333
0.666667
0
0
0
0
1
2
8
3
0
0
2
0
0
-1
0
4.63
"https://relevantmagazine.com/god/church/what-christians-get-wrong-about-easter-story"
3
0
"US"
-74
0
0
0.08427
0
0.4375
0.5625
0
0
0
1
1
0
28
2
0
0
806
0
0
-1
0
5.34
"http://www.thecassisbistro.ca/"
2
0
null
0
0
1
0
0
0.428571
0.571429
0
0
0
0
1
0
30
2
0
0.5
0
0
0
-1
0
4.14
"https://wannwowie.de/"
2
0
"GB"
-287
0
0
0
0
0.25
0.75
0
0
0
0
1
3
26
1
0
0.5
0
0
0
-1
1
0
"https://x3dom.org/docs-old/genindex.html"
0
0
"US"
-38
0
0
0.018182
0
0.5
0.5
0
0
0
0
0
0
30
1
0
0
1
0
0
-1
0
5.02
"http://198.57.247.160/~karim12/service/costumer/information/check/93cf21592dd57cdbc97d3906fc8da94a/index/web/4190f2b8546f3f74e6c3636e349ba086/login.php"
1
0
null
0
0
1
0
0
0
0
0
0
0
0
0
0
5
1
0
0
0
0
1
-1
1
null
"http://mark3d.com/yale/login.htm"
3
0
null
0
0
0
0
0
0.681159
0.318841
0
0
0
0
1
3
30
1
0
0.5
593
0
0
-1
1
4.23
"http://www.haiwainet.cn/n/2019/0121/c3543950-31484352.html"
0
0
null
0
0
0
0.939024
0
0.7
0.3
0
0
0
0
1
0
30
10
0
0
15
0
0
-1
0
4.04
"https://www.webland.ch/de-ch/hosting/optionen"
1
0
"GB"
-306
0
0
0
0
0.470588
0.529412
0
0
0
1
0
4
28
1
0
1
0
0
0
-1
0
3.57
"https://thetrustproject.org/trust-project-receives-funding-to-develop-trust-in-media/"
1
0
"US"
-343
0
0
0
0
0.366667
0.633333
0
0
0
1
0
0
29
1
0
0
126
0
0
-1
0
5.15
"http://creativecommons.org.au/blog/2009/01/opinionated-volunteers-wanted/"
1
0
null
0
0
0
0
0
0.183333
0.816667
0
0
0
0
1
6
30
2
0
0
0
0
0
-1
0
5.37
"http://herpaderpus.0fees.net/default.php?a=login"
0
0
null
0
0
1
0
0
0.333333
0.666667
0
0
0
0
1
1
30
1
0
0
0
0
0
-1
1
null
"https://www.back40design.com/bigcommerce"
1
0
"US"
-66
0
0
0
0
0.277778
0.722222
0
0
0
1
1
0
30
1
0
0
117
0
0
-1
0
4.07
"https://www.blueletterbible.org/esv/eph/3/11/s_1100011"
0
0
"US"
-38
0
0
0
0
0.557377
0.442623
0
0
0
0
0
5
30
1
0
0.5
5
0
0
-1
0
5.64
"https://www.sharp.com/health-classes/bls-for-health-care-providers-class-or-renewal-6/section-30949"
1
0
"US"
-242
0
0
0
0
0.5
0.5
0
0
0
1
1
1
30
1
0
0
20
0
0
-1
0
5.11
"https://www.relx.com/site-services/terms-and-conditions"
0
0
"BE"
-337
1
0
0
0
0.25
0.75
0
0
0
1
1
1
20
2
0
0.5
8
0
0
-1
0
5.24
"https://www.vtinfo.com/pf/product_finder.asp?custid=gre"
1
0
"US"
-333
0
0
0.5
0
0.6
0.4
0
0
0
0
1
1
19
2
0
0
1
0
0
-1
0
3.52
"http://www.artlebedev.com/stoloto/rapidoloto/"
1
0
null
0
0
0
0
0
0.5
0.5
0
0
0
1
0
0
29
1
0
0
2
0
0
-1
0
5.29
"https://www.m247.ro/ro/despre/echipa/mike-darcey-ro/"
1
0
"US"
-48
0
0
0
0
0.066667
0.933333
0
0
0
0
1
2
28
2
0
0
3
0
0
-1
0
3.54
"http://game.com/?utm_source=inp.one"
1
0
null
0
0
1
0
0
0.571429
0.428571
0
0
0
0
0
1
30
1
0
0
4
0
0
-1
0
4.95
"https://www.blazemeter.com/script-creation"
1
0
"US"
-76
0
0
0
0
0.241379
0.758621
0
0
0
1
1
1
30
1
0
0
4
0
0
-1
0
5.03
"https://ad.gt/buttons"
0
0
"US"
-305
0
0
0
0
1
0
0
0
0
0
1
0
29
3
0
0
4
0
0
-1
0
3.97
"https://www.nicelabel.com/es/product-selector?uid=11287"
3
0
"US"
-106
0
0
0
0
0.217391
0.782609
0
0
0
1
0
2
28
1
0
1
6
0
0
-1
0
4.29
"https://www.goguardian.com/newsroom.html"
1
0
"US"
-16
0
0
0
0
0.5
0.5
0
0
0
1
1
1
19
3
0
0
3
0
0
-1
0
5
"https://www.amazon.in/printtech-pattern-huawei-enhanced-dual-sim/dp/b01m0g9vpy?subscriptionid=akiai46kfzlxd4qvnu7a"
0
0
"US"
-102
0
0
0
0
0
1
0
0
0
0
1
0
30
1
0
0
2
0
0
-1
0
6.45
"https://flutter.io/docs/get-started/install/macos"
2
0
"US"
-44
0
0
0.165577
0
0.565217
0.434783
0
0
0
1
1
1
29
1
0
0
0
0
0
-1
0
5.27
"https://www.spandidos-publications.com/pages/mmr/abstracting"
0
0
"US"
-292
0
0
0
0
0.769231
0.230769
0
0
0
0
1
0
29
1
0
0
2
0
0
-1
0
5.31
"https://margauxny.com/products/the-demi-black-navy"
0
0
"US"
-255
0
0
0
0
0.380952
0.619048
0
0
0
1
1
0
29
2
0
0.5
420
0
0
-1
0
4.83
"https://flythemes.net/forums/topic/vacation-lite-remove-comment-tag/"
0
0
"US"
-39
0
0
0.041667
0
0.37931
0.62069
0
0
0
1
0
0
30
1
0
0
166
0
0
-1
0
5.29
"http://www.tes.co.uk/mypublicprofile.aspx?uc=932747&event=21"
4
0
null
0
0
0
0
0
0.7
0.3
0
0
0
0
1
1
29
3
0
0.5
0
0
0
-1
0
5.32
"http://www.journalonweb.com/meajo/forgotpass.asp"
1
0
null
0
0
1
0
0
0.5
0.5
0
0
0
0
1
0
30
2
0
0
0
0
0
-1
0
4.25
"http://www.geautomation.com/de/download/pacsystems-hochverf%c3%bcgbare-l%c3%b6sungen"
4
0
null
0
0
0
0
0
0.636364
0.363636
0
0
0
1
1
1
29
4
0
0
0
0
0
-1
0
4.14
"http://www.dpa.org.nz/news/preliminary-notice-dpa-agm-2018"
1
0
null
0
0
0
0
0
0.5
0.5
0
0
0
0
0
0
25
1
0
0
7
0
0
-1
0
4.88
"https://www.urbanbound.com/blog/4-questions-you-probably-have-about-relocation-tax-gross-ups"
1
0
"US"
-47
0
0
0.180556
0
0.75
0.25
0
0
0
1
1
0
30
2
0
0
99
0
0
-1
0
4.7
"https://bit.parts/?start=40"
0
0
"US"
-50
0
0
0
0
0.26087
0.73913
0
0
0
0
0
5
29
1
0
0
63
0
0
-1
0
3.98
"https://www.mecum.com/lots/lv0118-315122/1908-indian-single-board-track-racer/"
1
0
"US"
-82
0
0
0
0
0.4
0.6
0
0
0
0
1
3
30
2
0
0
66
0
0
-1
0
5.51
"https://www.jcvi.org/cms/research/past-projects/cmr/overview/?page=cmr_search&search_type=cog&crumbs=searches"
1
0
"US"
-105
0
0
0
0
0.142857
0.857143
0
0
0
1
1
4
30
1
0
0
35
0
0
-1
0
5.37
"https://www.iha.com.tr/haber-mamut-art-project-7nci-yilinda-50-yeni-sanatciyi-agirliyor-764131/"
1
0
"US"
-86
0
0
0
0
0.175
0.825
0
0
0
0
0
2
27
1
0
0
45
0
0
-1
0
5.03
End of preview (truncated to 100 rows)

Important Notice:

  • A subset of the URL dataset is from Kaggle, and the Kaggle datasets contained 10%-15% mislabelled data. See this dicussion I opened for some false positives. I have contacted Kaggle regarding their erroneous "Usability" score calculation for these unreliable datasets.
  • The feature extraction methods shown here are not robust at all in 2023, and there're even silly mistakes in 3 functions: not_indexed_by_google, domain_registration_length, and age_of_domain.

The features dataset is original, and my feature extraction method is covered in feature_extraction.py. To extract features from a website, simply passed the URL and label to collect_data(). The features are saved to phishing_detection_dataset.csv locally by default.

In the features dataset, there're 911,180 websites online at the time of data collection. The plots below show the regression line and correlation coefficients of 22+ features extracted and whether the URL is malicious. If we could plot the lifespan of URLs, we could see that the oldest website has been online since Nov 7th, 2008, while the most recent phishing websites appeared as late as July 10th, 2023.

Malicious URL Categories

  • Defacement
  • Malware
  • Phishing

Data Analysis

Here are two images showing the correlation coefficient and correlation of determination between predictor values and the target value is_malicious.

Correlation Coefficient

Correlation of Determination

Let's exmain the correlations one by one and cross out any unreasonable or insignificant correlations.

Variable Justification for Crossing Out
redirects contracdicts previous research (as redirects increase, is_malicious tends to decrease by a little)
not_indexed_by_google 0.00 correlation
email_submission contracdicts previous research
request_url_percentage
issuer
certificate_age
url_anchor_percentage contracdicts previous research
meta_percentage 0.00 correlation
script_percentage
link_percentage
mouseover_changes contracdicts previous research & 0.00 correlation
right_clicked_disabled contracdicts previous research & 0.00 correlation
popup_window_has_text_field contracdicts previous research
use_iframe contracdicts previous research
has_suspicious_ports contracdicts previous research
external_favicons contracdicts previous research
TTL (Time to Live)
ip_address_count
TXT_record all websites had a TXT record
check_sfh contracdicts previous research
count_domain_occurrences
domain_registration_length
abnormal_url
age_of_domain
page_rank_decimal

Pre-training Ideas

For training, I split the classification task into two stages in anticipation of the limited availability of online phishing websites due to their short lifespan, as well as the possibility that research done on phishing is not up-to-date:

  1. a small multilingual BERT model to output the confidence level of a URL being malicious to model #2, by finetuning on 2,436,727 legitimate and malicious URLs
  2. (probably) LightGBM to analyze the confidence level, along with roughly 10 extracted features

This way, I can make the most out of the limited phishing websites avaliable.

Source of the URLs

Reference

Side notes

Downloads last month
9
Edit dataset card
Evaluate models HF Leaderboard

Models trained or fine-tuned on FredZhang7/malicious-website-features-2.4M