{"id":8931,"date":"2022-05-10T10:15:26","date_gmt":"2022-05-10T15:15:26","guid":{"rendered":"https:\/\/www.rushworth.us\/lisa\/?p=8931"},"modified":"2022-05-10T10:15:41","modified_gmt":"2022-05-10T15:15:41","slug":"elasticsearch-analyzer","status":"publish","type":"post","link":"https:\/\/www.rushworth.us\/lisa\/?p=8931","title":{"rendered":"ElasticSearch Analyzer"},"content":{"rendered":"<h2>Analyzer Components<\/h2>\n<p><a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-charfilters.html\">Character filters<\/a> are the first component of an analyzer. They can remove unwanted characters \u2013 this could be html tags (\u201cchar_filter\u201d: [\u201chtml_strip\u201d]) or some custom replacement \u2013 or change character(s) into other character(s). Output from the character filter is passed to the tokenizer.<\/p>\n<p>The <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-tokenizers.html\">tokenizer<\/a> breaks the string out into individual components (tokens). A commonly used tokenizer is the whitespace tokenizer which uses whitespace characters as the token delimiter. For CSV data, you could build a custom pattern tokenizer with \u201c,\u201d as the delimiter.<\/p>\n<p>Then <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-tokenfilters.html\">token filters<\/a> removes anything deemed unnecessary. The standard token filter applies a lower-case function too \u2013 so NOW, Now, and now all produce the same token.<\/p>\n<h2>Testing an analyzer<\/h2>\n<p>You can one-off analyze a string using any of the<\/p>\n<p>curl -u &#8220;admin:admin&#8221; -k -X GET https:\/\/localhost:9200\/_analyze &#8211;header &#8216;Content-Type: application\/json&#8217; &#8211;data &#8216;<\/p>\n<p>&#8220;analyzer&#8221;:&#8221;standard&#8221;,<\/p>\n<p>&#8220;text&#8221;: &#8220;THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG&#8217;\\&#8221;S BACK 1234567890&#8243;<\/p>\n<p>}&#8217;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1199\" height=\"212\" class=\"wp-image-8932\" src=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-3.png\" srcset=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-3.png 1199w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-3-300x53.png 300w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-3-1024x181.png 1024w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-3-768x136.png 768w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-3-750x133.png 750w\" sizes=\"auto, (max-width: 1199px) 100vw, 1199px\" \/><\/p>\n<p>Specifying different analyzers produces different tokens<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1193\" height=\"242\" class=\"wp-image-8933\" src=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-4.png\" srcset=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-4.png 1193w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-4-300x61.png 300w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-4-1024x208.png 1024w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-4-768x156.png 768w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-4-750x152.png 750w\" sizes=\"auto, (max-width: 1193px) 100vw, 1193px\" \/><\/p>\n<p>It\u2019s even possible to define a custom analyzer in an index \u2013 you\u2019ll see this in the index configuration. Adding character mappings to a custom filter \u2013 <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-mapping-charfilter.html\">the example used in Elastic\u2019s documentation maps Arabic numbers to their European counterparts<\/a> \u2013 might be a useful tool in our implementation. One of the examples is turning ASCII emoticons into emotional descriptors (_happy_, _sad_, _crying_, _raspberry_, etc) that would be useful in analyzing customer communications. In log processing, we might want to map phrases into commonly used abbreviations (not a real-world example, but if programmatic input spelled out \u201cself-contained breathing apparatus\u201d, I expect most people would still search for SCBA if they wanted to see how frequently SCBA tanks were used for call-outs). It will be interesting to see how frequently programmatic input doesn\u2019t line up with user expectations to see if character mappings will be beneficial.<\/p>\n<p>In addition to testing individual analyzers, you can test the analyzer associated to an index \u2013 instead of using the \/_analyze endpoint, use the \/indexname\/_analyze endpoint.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1199\" height=\"181\" class=\"wp-image-8934\" src=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-5.png\" srcset=\"https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-5.png 1199w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-5-300x45.png 300w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-5-1024x155.png 1024w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-5-768x116.png 768w, https:\/\/www.rushworth.us\/lisa\/wp-content\/uploads\/2022\/05\/word-image-5-750x113.png 750w\" sizes=\"auto, (max-width: 1199px) 100vw, 1199px\" \/><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Analyzer Components Character filters are the first component of an analyzer. They can remove unwanted characters \u2013 this could be html tags (\u201cchar_filter\u201d: [\u201chtml_strip\u201d]) or some custom replacement \u2013 or change character(s) into other character(s). Output from the character filter is passed to the tokenizer. The tokenizer breaks the string out into individual components (tokens). &hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1588],"tags":[1590,1589],"class_list":["post-8931","post","type-post","status-publish","format-standard","hentry","category-elk","tag-elasticsearch","tag-elk"],"_links":{"self":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts\/8931","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8931"}],"version-history":[{"count":1,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts\/8931\/revisions"}],"predecessor-version":[{"id":8935,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=\/wp\/v2\/posts\/8931\/revisions\/8935"}],"wp:attachment":[{"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8931"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8931"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rushworth.us\/lisa\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8931"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}