HTML data for the masses: data dump

HTML5I have been doing regex searches on the HTML of the 8900 or so of the top 10000 home pages I collected over easter and am providing the results of those searches I have conducted so far, in raw form:

Top 10000 web sites home pages HTML code data dump

Searches on the HTML of the 8900 sample pages were conducted on various HTML elements and attributes.

NOTE: the resulting data output files are sometimes large and the HTML code is whoeful, they are supplied as. I will as time permits analyse the data and also clean up the HTML code.

data dump
element/attribute HTML file size last modified date
address.html 338 KB 11/04/2012
alt.html 23573 KB 12/04/2012
aria.html 2566 KB 11/04/2012
audio.html 5 KB 10/04/2012
doctypeall-clean.zip 5 KB 11/04/2012
figure-figcaption.html 3034 KB 11/04/2012
footer.html 1853 KB 10/04/2012
generator.html 1548 KB 10/04/2012
header.html 2659 KB 11/04/2012
hgroup.html 247 KB 10/04/2012
label-placeholder.htm 258 KB 12/04/2012
longdesc.html 2194 KB 10/04/2012
nav.html 2194 KB 11/04/2012
placeholder-title.html 467 KB 12/04/2012
placeholder.html 1489 KB 12/04/2012
section.html 4202 KB 10/04/2012
summaryattribute.html 1068 KB 12/04/2012
tabindex.html 6848 KB 12/04/2012
th.html 5557 KB 12/04/2012
u.html 2363 KB 10/04/2012
video.html 143 KB 10/04/2012
top10000URL1.txt 330 KB 11/04/2012
top10000URL2.txt 79 KB 09/04/2012

further reading:

Categories: Technical

About Steve Faulkner

Steve was the Chief Accessibility Officer at TPGi before he left in October 2023. He joined TPGi in 2006 and was previously a Senior Web Accessibility Consultant at vision australia. Steve is a member of several groups, including the W3C Web Platforms Working Group and the W3C ARIA Working Group. He is an editor of several specifications at the W3C including ARIA in HTML and HTML Accessibility API Mappings 1.0. He also develops and maintains HTML5accessibility and the JAWS bug tracker/standards support.