{"id":11370,"date":"2024-11-18T18:56:00","date_gmt":"2024-11-18T23:56:00","guid":{"rendered":"https:\/\/www.rushworth.us\/lisa\/?p=11370"},"modified":"2025-01-16T13:08:56","modified_gmt":"2025-01-16T18:08:56","slug":"javascript-extracting-web-content-you-cannot-copy","status":"publish","type":"post","link":"https:\/\/www.rushworth.us\/lisa\/?p=11370","title":{"rendered":"JavaScript: Extracting Web Content You Cannot Copy"},"content":{"rendered":"\n<p>There are many times I need to copy &#8220;stuff&#8221; from a website that is structured in such a way that simply copy\/pasting the table data is impossible. Screen prints work, but I usually want the table of data <em>in Excel<\/em> so I can add notations and such. In these cases, running JavaScript from the browser&#8217;s developers console lets you access the underlying text elements.<\/p>\n\n\n\n<p>Right click on one of the text elements and select &#8220;Inspect&#8221;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-Inxpect.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"441\" height=\"505\" src=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-Inxpect.jpg\" alt=\"\" class=\"wp-image-11371\" srcset=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-Inxpect.jpg 441w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-Inxpect-262x300.jpg 262w\" sizes=\"auto, (max-width: 441px) 100vw, 441px\" \/><\/a><\/figure>\n\n\n\n<p>Now copy the element&#8217;s XPath<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-InspectXPath.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"731\" height=\"245\" src=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-InspectXPath.jpg\" alt=\"\" class=\"wp-image-11372\" srcset=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-InspectXPath.jpg 731w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-InspectXPath-300x101.jpg 300w\" sizes=\"auto, (max-width: 731px) 100vw, 731px\" \/><\/a><\/figure>\n\n\n\n<p>Read the value &#8212; we don&#8217;t generally want <em>just<\/em> this one element &#8230; but the path down to the &#8220;tbody&#8221; tag looks like a reasonable place to find the values within the table.<\/p>\n\n\n\n<p>\/html\/body\/div[1]\/div\/div\/div[2]\/div[2]\/div[2]\/div\/div[3]\/div\/div\/div[3]\/div\/div\/div\/table\/tbody\/a[4]\/td[2]\/div\/span[2]<\/p>\n\n\n\n<p>Use JavaScript to grab all of the TD elements under the tbody:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: jscript; title: ; notranslate\" title=\"\">\n\/\/ Define the XPath expression to select all &lt;td&gt; elements within the specific &lt;tbody&gt;\nconst xpathExpression = &quot;\/html\/body\/div&#x5B;1]\/div\/div\/div&#x5B;2]\/div&#x5B;2]\/div&#x5B;2]\/div\/div&#x5B;3]\/div\/div\/div&#x5B;3]\/div\/div\/div\/table\/tbody\/\/td&quot;;\n\n\/\/ Use document.evaluate to get all matching &lt;td&gt; nodes\nconst nodesSnapshot = document.evaluate(xpathExpression, document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);\n\n\/\/ Log the number of nodes found (for debugging purposes)\nconsole.log(&quot;Total &lt;td&gt; elements found:&quot;, nodesSnapshot.snapshotLength);\n\n\/\/ Iterate over the nodes and log their text content\nfor (let i = 0; i &lt; nodesSnapshot.snapshotLength; i++) {\n    let node = nodesSnapshot.snapshotItem(i);\n    if (node) {\n        const textContent = node.textContent.trim();\n        if (textContent) { \/\/ Only log non-empty content\n            console.log(textContent);\n        }\n    }\n}\n<\/pre><\/div>\n\n\n<p>Voila! I redacted some data below, but it&#8217;s just a list of values, one per line. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-ExractedText.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"770\" height=\"423\" src=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-ExractedText.jpg\" alt=\"\" class=\"wp-image-11373\" srcset=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-ExractedText.jpg 770w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-ExractedText-300x165.jpg 300w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-ExractedText-768x422.jpg 768w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2025\/01\/DevConsole-ExractedText-750x412.jpg 750w\" sizes=\"auto, (max-width: 770px) 100vw, 770px\" \/><\/a><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>There are many times I need to copy &#8220;stuff&#8221; from a website that is structured in such a way that simply copy\/pasting the table data is impossible. Screen prints work, but I usually want the table of data in Excel so I can add notations and such. In these cases, running JavaScript from the browser&#8217;s &hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[33,29],"tags":[1278,2081,2080,2079,1279,856],"class_list":["post-11370","post","type-post","status-publish","format-standard","hentry","category-coding","category-technology","tag-chrome","tag-data-extraction","tag-developer-tools","tag-edge","tag-firefox","tag-javascript"],"_links":{"self":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts\/11370","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=11370"}],"version-history":[{"count":1,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts\/11370\/revisions"}],"predecessor-version":[{"id":11374,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts\/11370\/revisions\/11374"}],"wp:attachment":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=11370"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=11370"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=11370"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}