From 13fd1c2dcd68781f38ffc416ea7f3823b639b2d0 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Fri, 4 Mar 2016 18:40:38 +0800 Subject: [PATCH 01/95] chapter46_part4: /510_Deployment/40_config.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 未完成 --- 510_Deployment/40_config.asciidoc | 138 +++++++++++++----------------- 1 file changed, 58 insertions(+), 80 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 94ea5404b..b0f29eeb8 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -1,53 +1,45 @@ -[[important-configuration-changes]] -=== Important Configuration Changes -Elasticsearch ships with _very good_ defaults,((("deployment", "configuration changes, important")))((("configuration changes, important"))) especially when it comes to performance- -related settings and options. When in doubt, just leave -the settings alone. We have witnessed countless dozens of clusters ruined -by errant settings because the administrator thought he could turn a knob -and gain 100-fold improvement. +[[重要配置的修改]] + +=== 重要配置的修改 +Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configuration changes, important")))((("configuration changes, important")))特别是涉及到性能相关的配置或者选项。 +如果你有疑问,最好就不要动它。我们已经目睹了数十个因为错误的设置而导致毁灭的集群, +因为它的管理者总认为改动一个配置或者选项就可以带来100倍的提升。 [NOTE] ==== -Please read this entire section! All configurations presented are equally -important, and are not listed in any particular order. Please read -through all configuration options and apply them to your cluster. +请阅读整节文章,所有的配置项都同等重要,和描述顺序无关,请阅读所有的配置选项,并应用到你的集群中。 ==== -Other databases may require tuning, but by and large, Elasticsearch does not. -If you are hitting performance problems, the solution is usually better data -layout or more nodes. There are very few "magic knobs" in Elasticsearch. -If there were, we'd have turned them already! +其它数据库可能需要调优,但总得来说,Elasticsearch 不需要。 +如果你遇到了性能问题,最好的解决方法通常是更好的数据布局或者更多的节点。 +在 Elasticsearch 中很少有“神奇的配置项”, +如果存在,我们也已经帮你优化了。 -With that said, there are some _logistical_ configurations that should be changed -for production. These changes are necessary either to make your life easier, or because -there is no way to set a good default (because it depends on your cluster layout). +说到这里,有一些保障性的配置需要在生产环境中做修改。 +这些改动是必须的,因为没有办法设定好的默认值(它取决于你的集群布局)。 -==== Assign Names +==== 指定名字 -Elasticseach by default starts a cluster named `elasticsearch`. ((("configuration changes, important", "assigning names"))) It is wise -to rename your production cluster to something else, simply to prevent accidents -whereby someone's laptop joins the cluster. A simple change to `elasticsearch_production` -can save a lot of heartache. +Elasticsearch 默认启动的集群名字叫 `elasticsearch`。((("configuration changes, important", "assigning names")))你最好 +给你的生产环境的集群改个名字,改名字的目的很简单, +就是防止某个人的笔记本加入到了集群。简单修改成 `elasticsearch_production` 会很省心。 -This can be changed in your `elasticsearch.yml` file: +你可以在你的 `elasticsearch.yml` 文件中: [source,yaml] ---- cluster.name: elasticsearch_production ---- -Similarly, it is wise to change the names of your nodes. As you've probably -noticed by now, Elasticsearch assigns a random Marvel superhero name -to your nodes at startup. This is cute in development--but less cute when it is -3a.m. and you are trying to remember which physical machine was Tagak the Leopard Lord. +同样,最好也修改你的节点名字。就像你现在可能发现的那样, +Elasticsearch 会在你的节点启动的时候随机给它指定一个名字。你可能会觉得这很有趣,但是当凌晨3点钟的时候, +你还在尝试回忆哪台物理机是 `Tagak the Leopard Lord` 的时候,你就不觉得有趣了。 -More important, since these names are generated on startup, each time you -restart your node, it will get a new name. This can make logs confusing, -since the names of all the nodes are constantly changing. +更重要的是,这些名字是在启动的时候产生的,每次启动节点, +它都会得到一个新的名字。这会使日志变得很混乱,因为所有节点的名称都是不断变化的。 -Boring as it might be, we recommend you give each node a name that makes sense -to you--a plain, descriptive name. This is also configured in your `elasticsearch.yml`: +这可能会让你觉得厌烦,我们建议给每个节点设置一个有意义的、清楚的、描述性的名字,同样你可以在 `elasticsearch.yml` 中配置: [source,yaml] ---- @@ -55,19 +47,17 @@ node.name: elasticsearch_005_data ---- -==== Paths +==== 路径 -By default, Elasticsearch will place the plug-ins,((("configuration changes, important", "paths"))) -((("paths"))) logs, and--most important--your data in the installation directory. This can lead to -unfortunate accidents, whereby the installation directory is accidentally overwritten -by a new installation of Elasticsearch. If you aren't careful, you can erase all your data. +默认情况下,((("configuration changes, important", "paths")))((("paths")))Elasticsearch 会把插件、日志以及你最重要的数据放在安装目录下。这会带来不幸的事故, +如果你重新安装 Elasticsearch 的时候不小心把安装目录覆盖了。如果你不小心,你就可能把你的全部数据删掉了。 -Don't laugh--we've seen it happen more than a few times. +不要笑,这种情况,我们见过很多次了。 -The best thing to do is relocate your data directory outside the installation -location. You can optionally move your plug-in and log directories as well. +最好的选择就是把你的数据目录配置到安装目录以外的地方, +同样你也可以选择转移你的插件和日志目录。 -This can be changed as follows: +可以更改如下: [source,yaml] ---- @@ -79,12 +69,10 @@ path.logs: /path/to/logs # Path to where plugins are installed: path.plugins: /path/to/plugins ---- -<1> Notice that you can specify more than one directory for data by using comma-separated lists. +<1> 注意:你可以通过逗号分隔指定多个目录。 -Data can be saved to multiple directories, and if each directory -is mounted on a different hard drive, this is a simple and effective way to -set up a software RAID 0. Elasticsearch will automatically stripe -data between the different directories, boosting performance. +数据可以保存到多个不同的目录, +每个目录如果是挂载在不同的硬盘,做 RAID 0 是一个简单而有效的办法。Elasticsearch 会自动把数据分隔到不同的目录,以便提高性能。 .Multiple data path safety and performance [WARNING] @@ -110,46 +98,39 @@ robustness and flexibility, we encourage you to use actual software RAID package instead of the multiple data path feature. ==================== -==== Minimum Master Nodes +==== 最小主节点数 -The `minimum_master_nodes` setting is _extremely_ important to the -stability of your cluster.((("configuration changes, important", "minimum_master_nodes setting")))((("minimum_master_nodes setting"))) This setting helps prevent _split brains_, the existence of two masters in a single cluster. +`minimum_master_nodes` 设定对你的集群的稳定 _及其_ 重要。 +((("configuration changes, important", "minimum_master_nodes setting")))((("minimum_master_nodes setting"))) +当你的集群中有两个 masters(注:主节点)的时候,这个配置有助于防止集群分裂(注:脑裂)。 -When you have a split brain, your cluster is at danger of losing data. Because -the master is considered the supreme ruler of the cluster, it decides -when new indices can be created, how shards are moved, and so forth. If you have _two_ -masters, data integrity becomes perilous, since you have two nodes -that think they are in charge. +如果你的集群发生了一个脑裂,那么你的集群就会处在丢失数据的危险中,因为 +节点是被认为是这个集群的最高统治者,它决定了什么时候新的索引可以创建,多少分片要移动等等。如果你有 _两个_ masters 节点, +你的数据的完整性将得不到保证,因为你有两个节点认为他们有集群的控制权。 -This setting tells Elasticsearch to not elect a master unless there are enough -master-eligible nodes available. Only then will an election take place. +这个配置就是告诉 Elasticsearch 当没有足够 master 候选节点的时候,就不要进行 master 节点选举,等 master 候选节点足够了才进行选举。 -This setting should always be configured to a quorum (majority) of your master-eligible nodes.((("quorum"))) A quorum is `(number of master-eligible nodes / 2) + 1`. -Here are some examples: +此设置应该始终被配置为 master 候选节点的法定个数(大多数个)。((("quorum")))法定个数就是 `( master 候选节点个数 / 2) + 1`。 +这里有几个例子: -- If you have ten regular nodes (can hold data, can become master), a quorum is -`6`. -- If you have three dedicated master nodes and a hundred data nodes, the quorum is `2`, -since you need to count only nodes that are master eligible. -- If you have two regular nodes, you are in a conundrum. A quorum would be -`2`, but this means a loss of one node will make your cluster inoperable. A -setting of `1` will allow your cluster to function, but doesn't protect against -split brain. It is best to have a minimum of three nodes in situations like this. +- 如果你有10个节点(能保存数据,同时能成为 master),法定数就是 `6`。 +- 如果你有3个候选 master 节点,和100个 date 节点,法定数就是 `2`,你只要数数那些可以做 master 的节点数就可以了。 +- 如果你有两个节点,你遇到难题了。法定数当然是 `2`,但是这意味着如果有一个节点挂掉,你整个集群就不可用了。 +设置成 `1` 可以保证集群的功能,但是就无法保证集群脑裂了,像这样的情况,你最好至少保证有3个节点。 -This setting can be configured in your `elasticsearch.yml` file: +你可以在你的 `elasticsearch.yml` 文件中这样配置: [source,yaml] ---- discovery.zen.minimum_master_nodes: 2 ---- -But because Elasticsearch clusters are dynamic, you could easily add or remove -nodes that will change the quorum. It would be extremely irritating if you had -to push new configurations to each node and restart your whole cluster just to -change the setting. +但是由于 ELasticsearch 是动态的,你可以很容易的添加和删除节点, +但是这会改变这个法定个数。 +你不得不修改每一个索引节点的配置并且重启你的整个集群只是为了让配置生效,这将是非常痛苦的一件事情。 -For this reason, `minimum_master_nodes` (and other settings) can be configured -via a dynamic API call. You can change the setting while your cluster is online: +基于这个原因,`minimum_master_nodes`(还有一些其它配置)允许通过 API 调用的方式动态进行配置。 +当你的集群在线运行的时候,你可以这样修改配置: [source,js] ---- @@ -161,15 +142,12 @@ PUT /_cluster/settings } ---- -This will become a persistent setting that takes precedence over whatever is -in the static configuration. You should modify this setting whenever you add or -remove master-eligible nodes. +这将成为一个永久的配置,并且无论你配置项里配置的如何,这个将优先生效。当你添加和删除master节点的时候,你需要更改这个配置。 -==== Recovery Settings +==== 集群恢复方面的配置 -Several settings affect the behavior of shard recovery when -your cluster restarts.((("recovery settings")))((("configuration changes, important", "recovery settings"))) First, we need to understand what happens if nothing is -configured. +当你集群重启时,几个配置项影响你的分片恢复的表现。((("recovery settings")))((("configuration changes, important", "recovery settings")))首先,我们需要明白 +如果什么也没配置将会发生什么。 Imagine you have ten nodes, and each node holds a single shard--either a primary or a replica--in a 5 primary / 1 replica index. You take your From 3e8c67347123ae04c3f80a462e414d667b3ee8a6 Mon Sep 17 00:00:00 2001 From: Andreas Baakind Date: Wed, 25 Nov 2015 14:33:44 +0100 Subject: [PATCH 02/95] Fixed typo Fixed typo: clsuter -> cluster --- 520_Post_Deployment/60_restore.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/520_Post_Deployment/60_restore.asciidoc b/520_Post_Deployment/60_restore.asciidoc index d4f383eb6..f0ef88b6d 100644 --- a/520_Post_Deployment/60_restore.asciidoc +++ b/520_Post_Deployment/60_restore.asciidoc @@ -78,7 +78,7 @@ GET /_recovery/ ---- The output will look similar to this (and note, it can become very verbose -depending on the activity of your clsuter!): +depending on the activity of your cluster!): [source,js] ---- From 79d39237ef6626acace7139254abb401b8806370 Mon Sep 17 00:00:00 2001 From: Julian Simioni Date: Fri, 11 Dec 2015 11:00:54 -0500 Subject: [PATCH 03/95] Add missing closing parenthesis --- 130_Partial_Matching/05_Postcodes.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/130_Partial_Matching/05_Postcodes.asciidoc b/130_Partial_Matching/05_Postcodes.asciidoc index 5aea8d103..d3a47a907 100644 --- a/130_Partial_Matching/05_Postcodes.asciidoc +++ b/130_Partial_Matching/05_Postcodes.asciidoc @@ -7,7 +7,7 @@ postcode `W1V 3DG` can((("postcodes (UK), partial matching with"))) be broken do * `W1V`: This outer part identifies the postal area and district: ** `W` indicates the area (one or two letters) -** `1V` indicates the district (one or two numbers, possibly followed by a letter +** `1V` indicates the district (one or two numbers, possibly followed by a letter) * `3DG`: This inner part identifies a street or building: From e1930833840a83cbf30589fc96b60e851bbd0931 Mon Sep 17 00:00:00 2001 From: Jared Carey Date: Tue, 2 Feb 2016 16:14:03 -0700 Subject: [PATCH 04/95] typo --- 402_Nested/30_Nested_objects.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/402_Nested/30_Nested_objects.asciidoc b/402_Nested/30_Nested_objects.asciidoc index fc90de870..be4392d8b 100644 --- a/402_Nested/30_Nested_objects.asciidoc +++ b/402_Nested/30_Nested_objects.asciidoc @@ -79,7 +79,7 @@ The correlation between `Alice` and `31`, or between `John` and `2014-09-01`, ha from a search point of view, for storing an array of objects. This is the problem that _nested objects_ are designed to solve. By mapping -the `commments` field as type `nested` instead of type `object`, each nested +the `comments` field as type `nested` instead of type `object`, each nested object is indexed as a _hidden separate document_, something like this: [source,json] From 7e728f4d67bb5f508f4032b6dd09178a871cf3e2 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Wed, 2 Mar 2016 17:55:32 +0800 Subject: [PATCH 05/95] =?UTF-8?q?510=5FDeployment=5F20=5Fhardware=E7=9A=84?= =?UTF-8?q?=E7=BF=BB=E8=AF=91?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 510_Deployment_20_hardware的翻译 --- 510_Deployment/20_hardware.asciidoc | 142 ++++++++++------------------ 1 file changed, 49 insertions(+), 93 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index acc588466..98173d0f5 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,117 +1,73 @@ [[hardware]] -=== Hardware - -If you've been following the normal development path, you've probably been playing((("deployment", "hardware")))((("hardware"))) -with Elasticsearch on your laptop or on a small cluster of machines laying around. -But when it comes time to deploy Elasticsearch to production, there are a few -recommendations that you should consider. Nothing is a hard-and-fast rule; -Elasticsearch is used for a wide range of tasks and on a bewildering array of -machines. But these recommendations provide good starting points based on our experience with -production clusters. - -==== Memory - -If there is one resource that you will run out of first, it will likely be memory.((("hardware", "memory")))((("memory"))) -Sorting and aggregations can both be memory hungry, so enough heap space to -accommodate these is important.((("heap"))) Even when the heap is comparatively small, -extra memory can be given to the OS filesystem cache. Because many data structures -used by Lucene are disk-based formats, Elasticsearch leverages the OS cache to -great effect. - -A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB machines -are also common. Less than 8 GB tends to be counterproductive (you end up -needing many, many small machines), and greater than 64 GB has problems that we will -discuss in <>. +=== 硬件 + +按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch。 +但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,可以为你提供一个好的起点。 + +==== 内存 + +如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory"))) +排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap"))) 即使堆空间是比较小的时候, +也能为操作系统文件缓存提供额外的内存。因为Lucene使用的许多数据结构是基于磁盘的格式,Elasticsearch 利用操作系统缓存能产生很大效果。 + +64 GB内存的机器是非常理想的, 但是32 GB 和 16 GB 机器也是很常见的。少于8 GB 会适得其反 (你最终需要很多很多的小机器), 大于64 GB的机器也会有问题, +我们将在<>中讨论。 ==== CPUs -Most Elasticsearch deployments tend to be rather light on CPU requirements. As -such,((("CPUs (central processing units)")))((("hardware", "CPUs"))) the exact processor setup matters less than the other resources. You should -choose a modern processor with multiple cores. Common clusters utilize two to eight -core machines. +大多数 Elasticsearch 部署往往对CPU要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs"))) +确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 -If you need to choose between faster CPUs or more cores, choose more cores. The -extra concurrency that multiple cores offers will far outweigh a slightly faster -clock speed. +如果你要在更快的CUPs和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发将远远超过稍微快点的CPU速度。 -==== Disks +==== 硬盘 -Disks are important for all clusters,((("disks")))((("hardware", "disks"))) and doubly so for indexing-heavy clusters -(such as those that ingest log data). Disks are the slowest subsystem in a server, -which means that write-heavy clusters can easily saturate their disks, which in -turn become the bottleneck of the cluster. +硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks"))) 对高度索引的集群更是加倍重要 +(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,,使得它成为集群的瓶颈。 -If you can afford SSDs, they are by far superior to any spinning media. SSD-backed -nodes see boosts in both query and indexing performance. If you can afford it, -SSDs are the way to go. +如果你负担得起SSD,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于SSD 的节点,查询和索引性能都有提升。如果你负担得起,SSD是一个好的选择。 -.Check Your I/O Scheduler -**** -If you are using SSDs, make sure your OS I/O scheduler is((("I/O scheduler"))) configured correctly. -When you write data to disk, the I/O scheduler decides when that data is -_actually_ sent to the disk. The default under most *nix distributions is a -scheduler called `cfq` (Completely Fair Queuing). - -This scheduler allocates _time slices_ to each process, and then optimizes the -delivery of these various queues to the disk. It is optimized for spinning media: -the nature of rotating platters means it is more efficient to write data to disk -based on physical layout. - -This is inefficient for SSD, however, since there are no spinning platters -involved. Instead, `deadline` or `noop` should be used instead. The deadline -scheduler optimizes based on how long writes have been pending, while `noop` -is just a simple FIFO queue. - -This simple change can have dramatic impacts. We've seen a 500-fold improvement -to write throughput just by using the correct scheduler. +.检查你的 I/O 调度程序 **** +如果你正在使用SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler"))) 配置正确的。 +当你向硬盘写数据,I/O 调度程序决定何时把数据 +_实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列). + +调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转媒介优化的: +旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 -If you use spinning media, try to obtain the fastest disks possible (high-performance server disks, 15k RPM drives). +这对SSD来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline`调度程序基于写入等待时间进行优化, +`noop`只是一个简单的FIFO队列。 -Using RAID 0 is an effective way to increase disk speed, for both spinning disks -and SSD. There is no need to use mirroring or parity variants of RAID, since -high availability is built into Elasticsearch via replicas. +这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 +**** -Finally, avoid network-attached storage (NAS). People routinely claim their -NAS solution is faster and more reliable than local drives. Despite these claims, -we have never seen NAS live up to its hype. NAS is often slower, displays -larger latencies with a wider deviation in average latency, and is a single -point of failure. +如果你使用旋转媒介,尝试获取尽可能快的硬盘 (高性能服务器硬盘, 15k RPM 驱动器). -==== Network +使用RAID 0是提高硬盘速度的有效途径, 对旋转硬盘和SSD来说都是如此。没有必要使用镜像或其它RAID变体,因为高可用已经通过replicas内建于Elasticsearch之中。 -A fast and reliable network is obviously important to performance in a distributed((("hardware", "network")))((("network"))) -system. Low latency helps ensure that nodes can communicate easily, while -high bandwidth helps shard movement and recovery. Modern data-center networking -(1 GbE, 10 GbE) is sufficient for the vast majority of clusters. +最后,避免使用网络附加存储 (NAS)。人们常声称他们的NAS解决方案比本地驱动器更快更可靠。除却这些声称, +我们从没看到NAS能配得上它的大肆宣传。NAS常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 -Avoid clusters that span multiple data centers, even if the data centers are -colocated in close proximity. Definitely avoid clusters that span large geographic -distances. +==== 网络 -Elasticsearch clusters assume that all nodes are equal--not that half the nodes -are actually 150ms distant in another data center. Larger latencies tend to -exacerbate problems in distributed systems and make debugging and resolution -more difficult. +快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。 +低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络 +(1 GbE, 10 GbE) 对绝大多数集群都是足够的。 -Similar to the NAS argument, everyone claims that their pipe between data centers is -robust and low latency. This is true--until it isn't (a network failure will -happen eventually; you can count on it). From our experience, the hassle of -managing cross–data center clusters is simply not worth the cost. +即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。 -==== General Considerations +Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节点在150ms外的另一数据中心而有所不同。更大的延时会加重分布式系统中的问题而且使得调试和排错更困难。 -It is possible nowadays to obtain truly enormous machines:((("hardware", "general considerations"))) hundreds of gigabytes -of RAM with dozens of CPU cores. Conversely, it is also possible to spin up -thousands of small virtual machines in cloud platforms such as EC2. Which -approach is best? +和NAS的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 +从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。 -In general, it is better to prefer medium-to-large boxes. Avoid small machines, -because you don't want to manage a cluster with a thousand nodes, and the overhead -of simply running Elasticsearch is more apparent on such small boxes. +==== 一般注意事项 -At the same time, avoid the truly enormous machines. They often lead to imbalanced -resource usage (for example, all the memory is being used, but none of the CPU) and can -add logistical complexity if you have to run multiple nodes per machine. +获取真正的巨型机器在今天是可能的:((("hardware", "general considerations"))) 成百GB的RAM 和 几十个 CPU 核心。 +反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪条道路是最好的? +通常,选择中到大型机器更好。避免使用小型机器, +因为你不会希望去管理拥有上千个节点的集群,而且在这些小型机器上 运行Elasticsearch的开销也是显著的。 +与此同时,避免使用真正的巨型机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但CPU没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 From 277cddadc50bd2df6fff1bf146f2252074ddbe85 Mon Sep 17 00:00:00 2001 From: "pengqiuyuanfj@gmail.com" Date: Wed, 2 Mar 2016 22:31:39 +0800 Subject: [PATCH 06/95] chapter7_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 修改中英文中间需要空格 --- 510_Deployment/20_hardware.asciidoc | 49 ++++++++++++++--------------- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 98173d0f5..312273670 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,73 +1,72 @@ [[hardware]] === 硬件 -按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch。 +按照正常的流程,你可能已经 ((("deployment", "hardware")))((("hardware"))) 在自己的笔记本电脑或集群上使用了 Elasticsearch 。 但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,可以为你提供一个好的起点。 ==== 内存 -如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory"))) -排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap"))) 即使堆空间是比较小的时候, -也能为操作系统文件缓存提供额外的内存。因为Lucene使用的许多数据结构是基于磁盘的格式,Elasticsearch 利用操作系统缓存能产生很大效果。 +如果有一种资源是最先被耗尽的,它可能是内存。 ((("hardware", "memory")))((("memory"))) +排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。 ((("heap"))) 即使堆空间是比较小的时候, +也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式, Elasticsearch 利用操作系统缓存能产生很大效果。 -64 GB内存的机器是非常理想的, 但是32 GB 和 16 GB 机器也是很常见的。少于8 GB 会适得其反 (你最终需要很多很多的小机器), 大于64 GB的机器也会有问题, -我们将在<>中讨论。 +64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反 (你最终需要很多很多的小机器), 大于64 GB的机器也会有问题, +我们将在 <> 中讨论。 ==== CPUs -大多数 Elasticsearch 部署往往对CPU要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs"))) +大多数 Elasticsearch 部署往往对CPU要求很小。因此, ((("CPUs (central processing units)")))((("hardware", "CPUs"))) 确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 -如果你要在更快的CUPs和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发将远远超过稍微快点的CPU速度。 +如果你要在更快的CPUs和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 ==== 硬盘 -硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks"))) 对高度索引的集群更是加倍重要 -(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,,使得它成为集群的瓶颈。 +硬盘对所有的集群都很重要, ((("disks")))((("hardware", "disks"))) 对高度索引的集群更是加倍重要 +(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 -如果你负担得起SSD,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于SSD 的节点,查询和索引性能都有提升。如果你负担得起,SSD是一个好的选择。 +如果你负担得起 SSD ,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 .检查你的 I/O 调度程序 **** -如果你正在使用SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler"))) 配置正确的。 +如果你正在使用 SSDs ,确保你的系统 I/O 调度程序是 ((("I/O scheduler"))) 配置正确的。 当你向硬盘写数据,I/O 调度程序决定何时把数据 _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列). 调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转媒介优化的: 旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 -这对SSD来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline`调度程序基于写入等待时间进行优化, -`noop`只是一个简单的FIFO队列。 +这对 SSD 来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline`调度程序基于写入等待时间进行优化, +`noop`只是一个简单的 FIFO 队列。 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 **** 如果你使用旋转媒介,尝试获取尽可能快的硬盘 (高性能服务器硬盘, 15k RPM 驱动器). -使用RAID 0是提高硬盘速度的有效途径, 对旋转硬盘和SSD来说都是如此。没有必要使用镜像或其它RAID变体,因为高可用已经通过replicas内建于Elasticsearch之中。 +使用 RAID 0 是提高硬盘速度的有效途径,对旋转硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 -最后,避免使用网络附加存储 (NAS)。人们常声称他们的NAS解决方案比本地驱动器更快更可靠。除却这些声称, -我们从没看到NAS能配得上它的大肆宣传。NAS常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 +最后,避免使用网络附加存储 (NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, +我们从没看到NAS能配得上它的大肆宣传。 NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 ==== 网络 -快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。 -低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络 -(1 GbE, 10 GbE) 对绝大多数集群都是足够的。 +快速可靠的网络显然对分布式系统的性能是很重要的 ((("hardware", "network")))((("network"))) 。 +低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络 (1 GbE, 10 GbE) 对绝大多数集群都是足够的。 即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。 -Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节点在150ms外的另一数据中心而有所不同。更大的延时会加重分布式系统中的问题而且使得调试和排错更困难。 +Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节点在150ms 外的另一数据中心而有所不同。更大的延时会加重分布式系统中的问题而且使得调试和排错更困难。 -和NAS的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 +和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。 ==== 一般注意事项 -获取真正的巨型机器在今天是可能的:((("hardware", "general considerations"))) 成百GB的RAM 和 几十个 CPU 核心。 +获取真正的巨型机器在今天是可能的:((("hardware", "general considerations"))) 成百 GB 的 RAM 和 几十个 CPU 核心。 反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪条道路是最好的? 通常,选择中到大型机器更好。避免使用小型机器, -因为你不会希望去管理拥有上千个节点的集群,而且在这些小型机器上 运行Elasticsearch的开销也是显著的。 +因为你不会希望去管理拥有上千个节点的集群,而且在这些小型机器上 运行 Elasticsearch 的开销也是显著的。 -与此同时,避免使用真正的巨型机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但CPU没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 +与此同时,避免使用真正的巨型机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 From 61ecfa1e119cddc67d766846a16394c8d6a9b8aa Mon Sep 17 00:00:00 2001 From: "pengqiuyuanfj@gmail.com" Date: Thu, 3 Mar 2016 00:37:43 +0800 Subject: [PATCH 07/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 小括号前后去掉空格 --- 510_Deployment/20_hardware.asciidoc | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 312273670..ff8804176 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,35 +1,33 @@ [[hardware]] === 硬件 -按照正常的流程,你可能已经 ((("deployment", "hardware")))((("hardware"))) 在自己的笔记本电脑或集群上使用了 Elasticsearch 。 +按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch 。 但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,可以为你提供一个好的起点。 ==== 内存 -如果有一种资源是最先被耗尽的,它可能是内存。 ((("hardware", "memory")))((("memory"))) -排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。 ((("heap"))) 即使堆空间是比较小的时候, +如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory")))排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap")))即使堆空间是比较小的时候, 也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式, Elasticsearch 利用操作系统缓存能产生很大效果。 -64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反 (你最终需要很多很多的小机器), 大于64 GB的机器也会有问题, +64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器), 大于64 GB的机器也会有问题, 我们将在 <> 中讨论。 ==== CPUs -大多数 Elasticsearch 部署往往对CPU要求很小。因此, ((("CPUs (central processing units)")))((("hardware", "CPUs"))) -确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 +大多数 Elasticsearch 部署往往对CPU要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 如果你要在更快的CPUs和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 ==== 硬盘 -硬盘对所有的集群都很重要, ((("disks")))((("hardware", "disks"))) 对高度索引的集群更是加倍重要 +硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对高度索引的集群更是加倍重要 (例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 如果你负担得起 SSD ,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 .检查你的 I/O 调度程序 **** -如果你正在使用 SSDs ,确保你的系统 I/O 调度程序是 ((("I/O scheduler"))) 配置正确的。 +如果你正在使用 SSDs ,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 当你向硬盘写数据,I/O 调度程序决定何时把数据 _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列). @@ -42,7 +40,7 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 **** -如果你使用旋转媒介,尝试获取尽可能快的硬盘 (高性能服务器硬盘, 15k RPM 驱动器). +如果你使用旋转媒介,尝试获取尽可能快的硬盘(高性能服务器硬盘, 15k RPM 驱动器). 使用 RAID 0 是提高硬盘速度的有效途径,对旋转硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 @@ -51,7 +49,7 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都 ==== 网络 -快速可靠的网络显然对分布式系统的性能是很重要的 ((("hardware", "network")))((("network"))) 。 +快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。 低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络 (1 GbE, 10 GbE) 对绝大多数集群都是足够的。 即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。 @@ -63,7 +61,7 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 ==== 一般注意事项 -获取真正的巨型机器在今天是可能的:((("hardware", "general considerations"))) 成百 GB 的 RAM 和 几十个 CPU 核心。 +获取真正的巨型机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和 几十个 CPU 核心。 反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪条道路是最好的? 通常,选择中到大型机器更好。避免使用小型机器, From 0ae975188152ecb5df57e80c491c8f95f2c6f0f3 Mon Sep 17 00:00:00 2001 From: "pengqiuyuanfj@gmail.com" Date: Thu, 3 Mar 2016 00:45:00 +0800 Subject: [PATCH 08/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 添加一个主语,“可以为你提供一个好的起点”改为“这些建议可以为你提供一个好的起点” --- 510_Deployment/20_hardware.asciidoc | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index ff8804176..a981415f0 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -2,7 +2,7 @@ === 硬件 按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch 。 -但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,可以为你提供一个好的起点。 +但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,这些建议可以为你提供一个好的起点。 ==== 内存 @@ -20,8 +20,7 @@ ==== 硬盘 -硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对高度索引的集群更是加倍重要 -(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 +硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对高度索引的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 如果你负担得起 SSD ,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 From fcbf756037c5b789f0884852e4ca455e8737c6d5 Mon Sep 17 00:00:00 2001 From: "pengqiuyuanfj@gmail.com" Date: Thu, 3 Mar 2016 01:03:29 +0800 Subject: [PATCH 09/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 移除一个空格 --- 510_Deployment/20_hardware.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index a981415f0..d1e258121 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -9,7 +9,7 @@ 如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory")))排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap")))即使堆空间是比较小的时候, 也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式, Elasticsearch 利用操作系统缓存能产生很大效果。 -64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器), 大于64 GB的机器也会有问题, +64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB的机器也会有问题, 我们将在 <> 中讨论。 ==== CPUs From 8690e60e8eb7b9dec4575ff27795f19ebaacda9e Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 09:55:39 +0800 Subject: [PATCH 10/95] 510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 修改英文符号 --- 510_Deployment/20_hardware.asciidoc | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index d1e258121..cf1b89d68 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -9,7 +9,7 @@ 如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory")))排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap")))即使堆空间是比较小的时候, 也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式, Elasticsearch 利用操作系统缓存能产生很大效果。 -64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB的机器也会有问题, +64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB 的机器也会有问题, 我们将在 <> 中讨论。 ==== CPUs @@ -28,7 +28,7 @@ **** 如果你正在使用 SSDs ,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 当你向硬盘写数据,I/O 调度程序决定何时把数据 -_实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列). +_实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列)。 调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转媒介优化的: 旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 @@ -39,17 +39,17 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 **** -如果你使用旋转媒介,尝试获取尽可能快的硬盘(高性能服务器硬盘, 15k RPM 驱动器). +如果你使用旋转媒介,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。 使用 RAID 0 是提高硬盘速度的有效途径,对旋转硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 -最后,避免使用网络附加存储 (NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, +最后,避免使用网络附加存储(NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, 我们从没看到NAS能配得上它的大肆宣传。 NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 ==== 网络 快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。 -低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络 (1 GbE, 10 GbE) 对绝大多数集群都是足够的。 +低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络(1 GbE, 10 GbE)对绝大多数集群都是足够的。 即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。 From 2c27bfb94ef7baf3976af89d704a64a9de706913 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 10:44:58 +0800 Subject: [PATCH 11/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit “小型、中型、大型机器”修改为“高配,中配,低配机器” --- 510_Deployment/20_hardware.asciidoc | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index cf1b89d68..f468db95e 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,7 +1,7 @@ [[hardware]] === 硬件 -按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch 。 +按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch。 但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,这些建议可以为你提供一个好的起点。 ==== 内存 @@ -14,7 +14,7 @@ ==== CPUs -大多数 Elasticsearch 部署往往对CPU要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 +大多数 Elasticsearch 部署往往对 CPU 要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 如果你要在更快的CPUs和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 @@ -22,19 +22,19 @@ 硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对高度索引的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 -如果你负担得起 SSD ,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 +如果你负担得起 SSD,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 .检查你的 I/O 调度程序 **** -如果你正在使用 SSDs ,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 +如果你正在使用 SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 当你向硬盘写数据,I/O 调度程序决定何时把数据 _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列)。 调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转媒介优化的: 旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 -这对 SSD 来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline`调度程序基于写入等待时间进行优化, -`noop`只是一个简单的 FIFO 队列。 +这对 SSD 来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, +`noop` 只是一个简单的 FIFO 队列。 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 **** @@ -43,7 +43,7 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都 使用 RAID 0 是提高硬盘速度的有效途径,对旋转硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 -最后,避免使用网络附加存储(NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, +最后,避免使用网络附加存储 (NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, 我们从没看到NAS能配得上它的大肆宣传。 NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 ==== 网络 @@ -60,10 +60,10 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 ==== 一般注意事项 -获取真正的巨型机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和 几十个 CPU 核心。 +获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和 几十个 CPU 核心。 反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪条道路是最好的? -通常,选择中到大型机器更好。避免使用小型机器, -因为你不会希望去管理拥有上千个节点的集群,而且在这些小型机器上 运行 Elasticsearch 的开销也是显著的。 +通常,选择中配或者高配机器更好。避免使用低配机器, +因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上 运行 Elasticsearch 的开销也是显著的。 -与此同时,避免使用真正的巨型机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 +与此同时,避免使用真正的高配机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 From e20696257a2b193c0c4807b82bc2d8d4eb101ad1 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 10:57:41 +0800 Subject: [PATCH 12/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 空格 --- 510_Deployment/20_hardware.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index f468db95e..362459f55 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -16,7 +16,7 @@ 大多数 Elasticsearch 部署往往对 CPU 要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 -如果你要在更快的CPUs和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 +如果你要在更快的 CPUs 和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 ==== 硬盘 From d20ed672ee2ea478ff84522d595955d348e05cb1 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 11:35:35 +0800 Subject: [PATCH 13/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1、“旋转硬盘”修改为“机械硬盘” 2、“旋转媒介”修改为“旋转介质”,3个地方 3、“哪条道路”修改为“哪种方式” 4、“对 CPU 要求很小”修改为“对 CPU 要求不高” 5、“确切的处理器安装事项少于其他资源”修改为“相对其它资源,具体配置多少个(CPU)不是那么关键” 6、“高度索引”修改为“大量写入” --- 510_Deployment/20_hardware.asciidoc | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 362459f55..e658a6ed3 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -14,15 +14,15 @@ ==== CPUs -大多数 Elasticsearch 部署往往对 CPU 要求很小。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))确切的处理器安装事项少于其他资源。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 +大多数 Elasticsearch 部署往往对 CPU 要求不高。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))相对其它资源,具体配置多少个(CPU)不是那么关键。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 如果你要在更快的 CPUs 和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 ==== 硬盘 -硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对高度索引的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 +硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对大量写入的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 -如果你负担得起 SSD,它将远远超出任何旋转媒介(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 +如果你负担得起 SSD,它将远远超出任何旋转介质(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 .检查你的 I/O 调度程序 **** @@ -30,7 +30,7 @@ 当你向硬盘写数据,I/O 调度程序决定何时把数据 _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列)。 -调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转媒介优化的: +调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转介质优化的: 旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 这对 SSD 来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, @@ -39,9 +39,9 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 **** -如果你使用旋转媒介,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。 +如果你使用旋转介质,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。 -使用 RAID 0 是提高硬盘速度的有效途径,对旋转硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 +使用 RAID 0 是提高硬盘速度的有效途径,对机械硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 最后,避免使用网络附加存储 (NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, 我们从没看到NAS能配得上它的大肆宣传。 NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 @@ -61,7 +61,7 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 ==== 一般注意事项 获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和 几十个 CPU 核心。 -反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪条道路是最好的? +反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪种方式是最好的? 通常,选择中配或者高配机器更好。避免使用低配机器, 因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上 运行 Elasticsearch 的开销也是显著的。 From ff6ba387b6aca36a076ca83d722108c38a170d2f Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 14:48:22 +0800 Subject: [PATCH 14/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1、标点符号旁边的英文 移除和中文之间的空格。如:(,Elasticsearch 利用) 2、“但 CPU 没有”修改为“但 CPU 却没有” 3、英文括号换成全角的中文括号 --- 510_Deployment/20_hardware.asciidoc | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index e658a6ed3..2b45be478 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -7,9 +7,9 @@ ==== 内存 如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory")))排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap")))即使堆空间是比较小的时候, -也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式, Elasticsearch 利用操作系统缓存能产生很大效果。 +也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式,Elasticsearch 利用操作系统缓存能产生很大效果。 -64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB 的机器也会有问题, +64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB 的机器也会有问题, 我们将在 <> 中讨论。 ==== CPUs @@ -20,15 +20,15 @@ ==== 硬盘 -硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对大量写入的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 +硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对大量写入的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 -如果你负担得起 SSD,它将远远超出任何旋转介质(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起, SSD 是一个好的选择。 +如果你负担得起 SSD,它将远远超出任何旋转介质(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起,SSD 是一个好的选择。 .检查你的 I/O 调度程序 **** 如果你正在使用 SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 当你向硬盘写数据,I/O 调度程序决定何时把数据 -_实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq` (完全公平队列)。 +_实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq`(完全公平队列)。 调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转介质优化的: 旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 @@ -39,23 +39,23 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 **** -如果你使用旋转介质,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。 +如果你使用旋转介质,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。 使用 RAID 0 是提高硬盘速度的有效途径,对机械硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 -最后,避免使用网络附加存储 (NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, -我们从没看到NAS能配得上它的大肆宣传。 NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 +最后,避免使用网络附加存储(NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, +我们从没看到 NAS 能配得上它的大肆宣传。NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 ==== 网络 快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。 -低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络(1 GbE, 10 GbE)对绝大多数集群都是足够的。 +低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络(1 GbE, 10 GbE)对绝大多数集群都是足够的。 即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。 Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节点在150ms 外的另一数据中心而有所不同。更大的延时会加重分布式系统中的问题而且使得调试和排错更困难。 -和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 +和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。 ==== 一般注意事项 @@ -66,4 +66,4 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 通常,选择中配或者高配机器更好。避免使用低配机器, 因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上 运行 Elasticsearch 的开销也是显著的。 -与此同时,避免使用真正的高配机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 +与此同时,避免使用真正的高配机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 却没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 From facb70fec9cd8842c651eb6cdcb9b9786f64e38a Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 18:40:06 +0800 Subject: [PATCH 15/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit “旋转盘片” 修改为 “机械硬盘” --- 510_Deployment/20_hardware.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 2b45be478..6308ade26 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -31,9 +31,9 @@ _实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq`(完全公平队列)。 调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转介质优化的: -旋转盘片的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 +机械硬盘的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 -这对 SSD 来说是低效的,然而,尽管这里没有涉及到旋转盘片。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, +这对 SSD 来说是低效的,然而,尽管这里没有涉及到机械硬盘。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, `noop` 只是一个简单的 FIFO 队列。 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 From bb8f539f1007f416dc08d5308cb5b536e32bd02c Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 19:20:06 +0800 Subject: [PATCH 16/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 移除多余的空格 --- 510_Deployment/20_hardware.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 6308ade26..544911ace 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -60,10 +60,10 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 ==== 一般注意事项 -获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和 几十个 CPU 核心。 +获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和几十个 CPU 核心。 反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪种方式是最好的? 通常,选择中配或者高配机器更好。避免使用低配机器, -因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上 运行 Elasticsearch 的开销也是显著的。 +因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上运行 Elasticsearch 的开销也是显著的。 与此同时,避免使用真正的高配机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 却没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 From bae79c30afdb1e951428515c19e1c0a62f3b1fca Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Fri, 4 Mar 2016 11:52:07 +0800 Subject: [PATCH 17/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1、修复三括号导致的中文间有空隙的问题 2、“_实际_”修改为“实际” 3、移除一个语气词“然而” 4、“一般注意事项”修改为“注意事项” --- 510_Deployment/20_hardware.asciidoc | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 544911ace..95ddc72d2 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,7 +1,7 @@ [[hardware]] === 硬件 -按照正常的流程,你可能已经((("deployment", "hardware")))((("hardware")))在自己的笔记本电脑或集群上使用了 Elasticsearch。 +按照正常的流程,((("deployment", "hardware")))((("hardware")))你可能已经在自己的笔记本电脑或集群上使用了 Elasticsearch。 但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,这些建议可以为你提供一个好的起点。 ==== 内存 @@ -27,13 +27,13 @@ .检查你的 I/O 调度程序 **** 如果你正在使用 SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 -当你向硬盘写数据,I/O 调度程序决定何时把数据 -_实际_ 发送到硬盘。大多数默认 *nix 发行版下的调度程序都叫做 `cfq`(完全公平队列)。 +当你向硬盘写数据,I/O 调度程序决定何时把数据实际发送到硬盘。 +大多数默认 *nix 发行版下的调度程序都叫做 `cfq`(完全公平队列)。 调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转介质优化的: 机械硬盘的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 -这对 SSD 来说是低效的,然而,尽管这里没有涉及到机械硬盘。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, +这对 SSD 来说是低效的,尽管这里没有涉及到机械硬盘。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, `noop` 只是一个简单的 FIFO 队列。 这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 @@ -58,7 +58,7 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。 -==== 一般注意事项 +==== 注意事项 获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和几十个 CPU 核心。 反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪种方式是最好的? From c3bb310e05895eb80be6c77868bee8a62099323f Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Fri, 4 Mar 2016 11:56:44 +0800 Subject: [PATCH 18/95] chapter46_part2: /510_Deployment/20_hardware.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1、“general considerations” 翻译为“总则” --- 510_Deployment/20_hardware.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index 95ddc72d2..d7feccb06 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -58,7 +58,7 @@ Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节 和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。 -==== 注意事项 +==== 总则 获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和几十个 CPU 核心。 反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪种方式是最好的? From b62ab4aeea0cfe792ceb942417b0b5a61030a01b Mon Sep 17 00:00:00 2001 From: Yuhao Bi Date: Sun, 6 Mar 2016 01:14:54 +0800 Subject: [PATCH 19/95] chapter44_part1: 410_Scaling/10_Intro.asciidoc --- 410_Scaling/10_Intro.asciidoc | 34 +++++++++++----------------------- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/410_Scaling/10_Intro.asciidoc b/410_Scaling/10_Intro.asciidoc index fc4d8ec0d..8cd3e87c8 100644 --- a/410_Scaling/10_Intro.asciidoc +++ b/410_Scaling/10_Intro.asciidoc @@ -1,29 +1,17 @@ [[scale]] -== Designing for Scale +== 扩容设计 -Elasticsearch is used by some companies to index ((("scaling", "designing for scale")))and search petabytes of data -every day, but most of us start out with something a little more humble in -size. Even if we aspire to be the next Facebook, it is unlikely that our bank -balance matches our aspirations. We need to build for what we have today, but -in a way that will allow us to scale out flexibly and rapidly. +一些公司每天使用 Elasticsearch((("scaling", "designing for scale"))) 索引检索 PB 级数据, +但我们中的大多数都起步于规模稍逊的项目。即使我们立志成为下一个 Facebook,我们的银行卡余额却也跟不上梦想的脚步。 +我们需要为今日所需而构建,但也要允许我们可以灵活而又快速地进行水平扩展。 -Elasticsearch is built to scale. It will run very happily on your laptop or -in a cluster containing hundreds of nodes, and the experience is almost -identical. Growing from a small cluster to a large cluster is almost entirely -automatic and painless. Growing from a large cluster to a very large cluster -requires a bit more planning and design, but it is still relatively painless. +Elasticsearch 为了可扩展性而生。它可以良好地运行于你的笔记本电脑又或者一个拥有数百节点的集群,同时用户体验基本相同。 +由小规模集群增长为大规模集群的过程几乎完全自动化并且无痛。由大规模集群增长为超大规模集群需要一些规划和设计,但还是相对地无痛。 -Of course, it is not magic. Elasticsearch has its limitations too. If you -are aware of those limitations and work with them, the growing process will be -pleasant. If you treat Elasticsearch badly, you could be in for a world of -pain. +当然这一切并不是魔法。Elasticsearch 也有它的局限性。如果你了解这些局限性并能够与之相处,集群扩容的过程将会是愉快的。 +如果你对 Elasticsearch 处理不当,那么你将处于一个充满痛苦的世界。 -The default settings in Elasticsearch will take you a long way, but to get the -most bang for your buck, you need to think about how data flows through your -system. We will talk about two common data flows: time-based data (such as log -events or social network streams, where relevance is driven by recency), and -user-based data (where a large document collection can be subdivided by user or -customer). +Elasticsearch 的默认设置会伴你走过很长的一段路,但为了发挥它最大的效用,你需要考虑数据是如何流经你的系统的。 +我们将讨论两种常见的数据流:时序数据(时间驱动相关性,例如日志或社交网络数据流),以及基于用户的数据(拥有很大的文档集但可以按用户或客户细分)。 -This chapter will help you make the right decisions up front, to avoid -nasty surprises later. +这一章将帮助你在遇到不愉快之前做出正确的选择。 From 3e31f3ed1ef596586c0eada5688576595da9684d Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 3 Mar 2016 18:25:11 +0800 Subject: [PATCH 20/95] chapter5_part1: /50_Search/00_Intro.asciidoc Closes #9 --- 050_Search/00_Intro.asciidoc | 63 ++++------ 510_Deployment/50_heap.asciidoc | 209 +++++++++++++------------------- 2 files changed, 109 insertions(+), 163 deletions(-) diff --git a/050_Search/00_Intro.asciidoc b/050_Search/00_Intro.asciidoc index 6b6516a6c..1c11ebd53 100644 --- a/050_Search/00_Intro.asciidoc +++ b/050_Search/00_Intro.asciidoc @@ -1,60 +1,43 @@ [[search]] -== Searching--The Basic Tools +== 搜索——最基本的工具 -So far, we have learned how to use Elasticsearch as a simple NoSQL-style -distributed document store. We can ((("searching")))throw JSON documents at Elasticsearch and -retrieve each one by ID. But the real power of Elasticsearch lies in its -ability to make sense out of chaos -- to turn Big Data into Big Information. +现在,我们已经学会了如何使用 Elasticsearch 作为一个简单的 NoSQL 风格的分布式文档存储系统。我们可以((("searching")))将一个 JSON 文档扔到 Elasticsearch 里,然后根据 ID 检索。但 Elasticsearch 真正强大之处在于可以从无规律的数据中找出有意义的信息——从“大数据”到“大信息”。 -This is the reason that we use structured JSON documents, rather than -amorphous blobs of data. Elasticsearch not only _stores_ the document, but -also _indexes_ the content of the document in order to make it searchable. +Elasticsearch 不只会_存储(stores)_ 文档,为了能被搜索到也会为文档添加_索引(indexes)_ ,这也是为什么我们使用结构化的 JSON 文档,而不是无结构的二进制数据。 -_Every field in a document is indexed and can be queried_. ((("indexing"))) And it's not just -that. During a single query, Elasticsearch can use _all_ of these indices, to -return results at breath-taking speed. That's something that you could never -consider doing with a traditional database. +_文档中的每个字段都将被索引并且可以被查询_ 。((("indexing")))不仅如此,在简单查询时,Elasticsearch 可以使用 _所有(all)_ 这些索引字段,以惊人的速度返回结果。这是你永远不会考虑用传统数据库去做的一些事情。 -A _search_ can be any of the following: +_搜索(search)_ 可以做到: -* A structured query on concrete fields((("fields", "searching on")))((("searching", "types of searches"))) like `gender` or `age`, sorted by - a field like `join_date`, similar to the type of query that you could construct - in SQL +* 在类似于 `gender` 或者 `age` 这样的字段((("fields", "searching on")))((("searching", "types of searches")))上使用结构化查询,`join_date` 这样的字段上使用排序,就像SQL的结构化查询一样。 -* A full-text query, which finds all documents matching the search keywords, - and returns them sorted by _relevance_ +* 全文检索,找出所有匹配关键字的文档并按照_相关性(relevance)_ 排序后返回结果。 -* A combination of the two +* 以上二者兼而有之。 -While many searches will just work out of((("full text search"))) the box, to use Elasticsearch to -its full potential, you need to understand three subjects: +很多搜索都是开箱即用的((("full text search"))),为了充分挖掘 Elasticsearch 的潜力,你需要理解以下三个概念: - _Mapping_:: - How the data in each field is interpreted - - _Analysis_:: - How full text is processed to make it searchable - - _Query DSL_:: - The flexible, powerful query language used by Elasticsearch + _映射(Mapping)_ :: + 描述数据在每个字段内如何存储 -Each of these is a big subject in its own right, and we explain them in -detail in <>. The chapters in this section introduce the -basic concepts of all three--just enough to help you to get an overall -understanding of how search works. + _分析(Analysis)_ :: + 全文是如何处理使之可以被搜索的 -We will start by explaining the `search` API in its simplest form. + _领域特定查询语言(Query DSL)_ :: + Elasticsearch 中强大灵活的查询语言 -.Test Data +以上提到的每个点都是一个大话题,我们将在 <> 一章详细阐述它们。本章节我们将介绍这三点的一些基本概念——仅仅帮助你大致了解搜索是如何工作的。 + +我们将使用最简单的形式开始介绍 `search` API。 + +.测试数据 **** -The documents that we will use for test purposes in this chapter can be found -in this gist: https://gist.github.com/clintongormley/8579281. +本章节的测试数据可以在这里找到: https://gist.github.com/clintongormley/8579281 。 -You can copy the commands and paste them into your shell in order to follow -along with this chapter. +你可以把这些命令复制到终端中执行来实践本章的例子。 -Alternatively, if you're in the online version of this book, you can link:sense_widget.html?snippets/050_Search/Test_data.json[click here to open in Sense]. +另外,如果你读的是在线版本,可以 link:sense_widget.html?snippets/050_Search/Test_data.json[点击这个链接] 感受下。 **** diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 2d45a8830..8e8105bba 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -1,101 +1,76 @@ [[heap-sizing]] -=== Heap: Sizing and Swapping +=== 堆内存:大小和交换 -The default installation of Elasticsearch is configured with a 1 GB heap. ((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting"))) For -just about every deployment, this number is far too small. If you are using the -default heap values, your cluster is probably configured incorrectly. +Elasticsearch 默认安装后设置的堆内存是 1 GB。((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting")))对于任何一个业务部署来说, +这个设置都太小了。如果你正在使用这些默认堆内存配置,您的群集可能会出现问题。 -There are two ways to change the heap size in Elasticsearch. The easiest is to -set an environment variable called `ES_HEAP_SIZE`.((("ES_HEAP_SIZE environment variable"))) When the server process -starts, it will read this environment variable and set the heap accordingly. -As an example, you can set it via the command line as follows: +这里有两种方式修改 Elasticsearch 的堆内存。最简单的一个方法就是指定 `ES_HEAP_SIZE` 环境变量。((("ES_HEAP_SIZE environment variable")))服务进程在启动时候会读取这个变量,并相应的设置堆的大小。 +比如,你可以用下面的命令设置它: [source,bash] ---- export ES_HEAP_SIZE=10g ---- -Alternatively, you can pass in the heap size via a command-line argument when starting -the process, if that is easier for your setup: +此外,你也可以通过命令行参数的形式,在程序启动的时候把内存大小传递给它,如果你觉得这样更简单的话: [source,bash] ---- ./bin/elasticsearch -Xmx10g -Xms10g <1> ---- -<1> Ensure that the min (`Xms`) and max (`Xmx`) sizes are the same to prevent -the heap from resizing at runtime, a very costly process. +<1> 确保堆内存最小值( `Xms` )与最大值( `Xmx` )的大小是相同的,防止程序在运行时改变堆内存大小, +这是一个很耗系统资源的过程。 -Generally, setting the `ES_HEAP_SIZE` environment variable is preferred over setting -explicit `-Xmx` and `-Xms` values. +通常来说,设置 `ES_HEAP_SIZE` 环境变量,比直接写 `-Xmx -Xms` 更好一点。 -==== Give Half Your Memory to Lucene +==== 把你的内存的一半给 Lucene -A common problem is configuring a heap that is _too_ large. ((("heap", "sizing and setting", "giving half your memory to Lucene"))) You have a 64 GB -machine--and by golly, you want to give Elasticsearch all 64 GB of memory. More -is better! +一个常见的问题是给 Elasticsearch 分配的内存 _太_ 大了。((("heap", "sizing and setting", "giving half your memory to Lucene")))假设你有一个 64 GB 内存的机器, +天啊,我要把 64 GB 内存全都给 Elasticsearch。因为越多越好啊! -Heap is definitely important to Elasticsearch. It is used by many in-memory data -structures to provide fast operation. But with that said, there is another major -user of memory that is _off heap_: Lucene. +当然,内存对于 Elasticsearch 来说绝对是重要的,它可以被许多内存数据结构使用来提供更快的操作。但是说到这里, +还有另外一个内存消耗大户 _非堆内存_ (off-heap):Lucene。 -Lucene is designed to leverage the underlying OS for caching in-memory data structures.((("Lucene", "memory for"))) -Lucene segments are stored in individual files. Because segments are immutable, -these files never change. This makes them very cache friendly, and the underlying -OS will happily keep hot segments resident in memory for faster access. +Lucene 被设计为可以利用操作系统底层机制来缓存内存数据结构。((("Lucene", "memory for"))) +Lucene 的段是分别存储到单个文件中的。因为段是不可变的,这些文件也都不会变化,这是对缓存友好的,同时操作系统也会把这些段文件缓存起来,以便更快的访问。 -Lucene's performance relies on this interaction with the OS. But if you give all -available memory to Elasticsearch's heap, there won't be any left over for Lucene. -This can seriously impact the performance of full-text search. +Lucene 的性能取决于和操作系统的相互作用。如果你把所有的内存都分配给 Elasticsearch 的堆内存,那将不会有剩余的内存交给 Lucene。 +这将严重地影响全文检索的性能。 -The standard recommendation is to give 50% of the available memory to Elasticsearch -heap, while leaving the other 50% free. It won't go unused; Lucene will happily -gobble up whatever is left over. +标准的建议是把 50% 的可用内存作为 Elasticsearch 的堆内存,保留剩下的 50%。当然它也不会被浪费,Lucene 会很乐意利用起余下的内存。 [[compressed_oops]] -==== Don't Cross 32 GB! -There is another reason to not allocate enormous heaps to Elasticsearch. As it turns((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))) -out, the HotSpot JVM uses a trick to compress object pointers when heaps are less -than around 32 GB. - -In Java, all objects are allocated on the heap and referenced by a pointer. -Ordinary object pointers (OOP) point at these objects, and are traditionally -the size of the CPU's native _word_: either 32 bits or 64 bits, depending on the -processor. The pointer references the exact byte location of the value. - -For 32-bit systems, this means the maximum heap size is 4 GB. For 64-bit systems, -the heap size can get much larger, but the overhead of 64-bit pointers means there -is more wasted space simply because the pointer is larger. And worse than wasted -space, the larger pointers eat up more bandwidth when moving values between -main memory and various caches (LLC, L1, and so forth). - -Java uses a trick called https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed oops]((("compressed object pointers"))) -to get around this problem. Instead of pointing at exact byte locations in -memory, the pointers reference _object offsets_.((("object offsets"))) This means a 32-bit pointer can -reference four billion _objects_, rather than four billion bytes. Ultimately, this -means the heap can grow to around 32 GB of physical size while still using a 32-bit -pointer. - -Once you cross that magical ~32 GB boundary, the pointers switch back to -ordinary object pointers. The size of each pointer grows, more CPU-memory -bandwidth is used, and you effectively lose memory. In fact, it takes until around -40–50 GB of allocated heap before you have the same _effective_ memory of a -heap just under 32 GB using compressed oops. - -The moral of the story is this: even when you have memory to spare, try to avoid -crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and -makes the GC struggle with large heaps. - -==== Just how far under 32gb should I set the JVM? - -Unfortunately, that depends. The exact cutoff varies by JVMs and platforms. -If you want to play it safe, setting the heap to `31gb` is likely safe. -Alternatively, you can verify the cutoff point for the HotSpot JVM by adding -`-XX:+PrintFlagsFinal` to your JVM options and checking that the value of the -UseCompressedOops flag is true. This will let you find the exact cutoff for your -platform and JVM. - -For example, here we test a Java 1.7 installation on MacOSX and see the max heap -size is around 32600mb (~31.83gb) before compressed pointers are disabled: +==== 不要超过 32 GB! +这里有另外一个原因不分配大内存给 Elasticsearch。事实上((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))), +JVM 在内存小于 32 GB 的时候会采用一个内存对象指针压缩技术。 + +在 Java 中,所有的对象都分配在堆上,并通过一个指针进行引用。 +普通对象指针(OOP)指向这些对象,通常为 CPU _字长_ 的大小:32 位或 64 位,取决于你的处理器。 + +对于 32 位的系统,意味着堆内存大小最大为 4 GB。对于 64 位的系统, +可以使用更大的内存,但是 64 位的指针意味着更大的浪费,因为你的指针本身大了。更糟糕的是, +更大的指针在主内存和各级缓存(例如 LLC,L1 等)之间移动数据的时候,会占用更多的带宽。 + +Java 使用一个叫作 https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[内存指针压缩(compressed oops)]((("compressed object pointers")))的技术来解决这个问题。 +它的指针不再表示对象在内存中的精确位置,而是表示 _偏移量_ 。((("object offsets")))这意味着 32 位的指针可以引用 40 亿个 _对象_ , +而不是 40 亿个字节。最终, +也就是说堆内存增长到 32 GB 的物理内存,也可以用 32 位的指针表示。 + +一旦你越过那个神奇的 ~32 GB 的边界,指针就会切回普通对象的指针。 +每个对象的指针都变长了,就会使用更多的 CPU 内存带宽,也就是说你实际上失去了更多的内存。事实上,当内存到达 +40–50 GB 的时候,有效内存才相当于使用内存对象指针压缩技术时候的 32 GB 内存。 + +这段描述的意思就是说:即便你有足够的内存,也尽量不要 +超过 32 GB。因为它浪费了内存,降低了 CPU 的性能,还要让 GC 应对大内存。 + +==== 到底需要低于 32 GB多少,来设置我的 JVM? + +遗憾的是,这需要看情况。确切的划分要根据 JVMs 和操作系统而定。 +如果你想保证其安全可靠,设置堆内存为 `31 GB` 是一个安全的选择。 +另外,你可以在你的 JVM 设置里添加 `-XX:+PrintFlagsFinal` 用来验证 `JVM` 的临界值, +并且检查 UseCompressedOops 的值是否为 true。对于你自己使用的 JVM 和操作系统,这将找到最合适的堆内存临界值。 + +例如,我们在一台安装 Java 1.7 的 MacOSX 上测试,可以看到指针压缩在被禁用之前,最大堆内存大约是在 32600 mb(~31.83 gb): [source,bash] ---- @@ -105,8 +80,7 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.7` java -Xmx32766m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -In contrast, a Java 1.8 installation on the same machine has a max heap size -around 32766mb (~31.99gb): +相比之下,同一台机器安装 Java 1.8,可以看到指针压缩在被禁用之前,最大堆内存大约是在 32766 mb(~31.99 gb): [source,bash] ---- @@ -116,86 +90,75 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.8` java -Xmx32767m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -The morale of the story is that the exact cutoff to leverage compressed oops -varies from JVM to JVM, so take caution when taking examples from elsewhere and -be sure to check your system with your configuration and JVM. +这个例子告诉我们,影响内存指针压缩使用的临界值, +是会根据 JVM 的不同而变化的。 +所以从其他地方获取的例子,需要谨慎使用,要确认检查操作系统配置和 JVM。 -Beginning with Elasticsearch v2.2.0, the startup log will actually tell you if your -JVM is using compressed OOPs or not. You'll see a log message like: +如果使用的是 Elasticsearch v2.2.0,启动日志其实会告诉你 JVM 是否正在使用内存指针压缩。 +你会看到像这样的日志消息: [source, bash] ---- [2015-12-16 13:53:33,417][INFO ][env] [Illyana Rasputin] heap size [989.8mb], compressed ordinary object pointers [true] ---- -Which indicates that compressed object pointers are being used. If they are not, -the message will say `[false]`. - +这表明内存指针压缩正在被使用。如果没有,日志消息会显示 `[false]` 。 [role="pagebreak-before"] -.I Have a Machine with 1 TB RAM! +.我有一个 1 TB 内存的机器! **** -The 32 GB line is fairly important. So what do you do when your machine has a lot -of memory? It is becoming increasingly common to see super-servers with 512–768 GB -of RAM. +这个 32 GB 的分割线是很重要的。那如果你的机器有很大的内存怎么办呢? +一台有着 512–768 GB内存的服务器愈发常见。 -First, we would recommend avoiding such large machines (see <>). +首先,我们建议避免使用这样的高配机器(参考 <>)。 -But if you already have the machines, you have two practical options: +但是如果你已经有了这样的机器,你有两个可选项: -- Are you doing mostly full-text search? Consider giving just under 32 GB to Elasticsearch -and letting Lucene use the rest of memory via the OS filesystem cache. All that -memory will cache segments and lead to blisteringly fast full-text search. +- 你主要做全文检索吗?考虑给 Elasticsearch 不超过 32 GB 的内存, +让 Lucene 通过操作系统文件缓存来利用余下的内存。那些内存都会用来缓存 segments,带来极速的全文检索。 -- Are you doing a lot of sorting/aggregations? You'll likely want that memory -in the heap then. Instead of one node with more than 32 GB of RAM, consider running two or -more nodes on a single machine. Still adhere to the 50% rule, though. So if your -machine has 128 GB of RAM, run two nodes, each with just under 32 GB. This means that less -than 64 GB will be used for heaps, and more than 64 GB will be left over for Lucene. +- 你需要更多的排序和聚合?你可能会更希望那些那些内存用在堆中。 +你可以考虑一台机器上创建两个或者更多 ES 节点,而不要部署一个使用或者超过 32 GB 内存的节点。 +仍然要坚持 50% 原则。假设你有个机器有 128 GB 的内存, +你可以创建两个节点,每个节点内存分配不超过 32 GB。 +也就是说不超过 64 GB 内存给 ES 的堆内存,剩下的超过 64 GB 的内存给 Lucene。 + -If you choose this option, set `cluster.routing.allocation.same_shard.host: true` -in your config. This will prevent a primary and a replica shard from colocating -to the same physical machine (since this would remove the benefits of replica high availability). +如果你选择第二种,你需要配置 `cluster.routing.allocation.same_shard.host: true` 。 +这会防止同一个分片(shard)的主副本存在同一个物理机上(因为如果存在一个机器上,副本的高可用性就没有了)。 **** -==== Swapping Is the Death of Performance +==== Swapping 是性能的坟墓 -It should be obvious,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance"))) but it bears spelling out clearly: swapping main memory -to disk will _crush_ server performance. Think about it: an in-memory operation -is one that needs to execute quickly. +这是显而易见的,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance")))但是还是有必要说的更清楚一点:内存交换 +到磁盘对服务器性能来说是 _致命_ 的。想想看:一个内存操作必须能够被快速执行。 -If memory swaps to disk, a 100-microsecond operation becomes one that take 10 -milliseconds. Now repeat that increase in latency for all other 10us operations. -It isn't difficult to see why swapping is terrible for performance. +如果内存交换到磁盘上,一个 100 微秒的操作可能变成 10 毫秒。 +再想想那么多 10 微秒的操作时延累加起来。 +不难看出 swapping 对于性能是多么可怕。 -The best thing to do is disable swap completely on your system. This can be done -temporarily: +最好的办法就是在你的操作系统中完全禁用 swap。这样可以暂时禁用: [source,bash] ---- sudo swapoff -a ---- -To disable it permanently, you'll likely need to edit your `/etc/fstab`. Consult -the documentation for your OS. +如果需要永久禁用,你可能需要修改 `/etc/fstab` 文件,这要参考你的操作系统相关文档。 -If disabling swap completely is not an option, you can try to lower `swappiness`. -This value controls how aggressively the OS tries to swap memory. -This prevents swapping under normal circumstances, but still allows the OS to swap -under emergency memory situations. +如果你并不打算完全禁用 swap,也可以选择降低 `swappiness` 的值。 +这个值决定操作系统交换内存的频率。 +这可以预防正常情况下发生交换,但仍允许操作系统在紧急情况下发生交换。 -For most Linux systems, this is configured using the `sysctl` value: +对于大部分Linux操作系统,可以在 `sysctl` 中这样配置: [source,bash] ---- vm.swappiness = 1 <1> ---- -<1> A `swappiness` of `1` is better than `0`, since on some kernel versions a `swappiness` -of `0` can invoke the OOM-killer. +<1> `swappiness` 设置为 `1` 比设置为 `0` 要好,因为在一些内核版本 `swappiness` 设置为 `0` 会触发系统 OOM-killer(注:Linux 内核的 Out of Memory(OOM)killer 机制)。 -Finally, if neither approach is possible, you should enable `mlockall`. - file. This allows the JVM to lock its memory and prevent -it from being swapped by the OS. In your `elasticsearch.yml`, set this: +最后,如果上面的方法都不合适,你需要打开配置文件中的 `mlockall` 开关。 +它的作用就是允许 JVM 锁住内存,禁止操作系统交换出去。在你的 `elasticsearch.yml` 文件中,设置如下: [source,yaml] ---- From 2f9ceee2c87918dee6607e3b0fa73ed635b7a774 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 10 Mar 2016 19:38:14 +0800 Subject: [PATCH 21/95] chapter46_part4: /510_Deployment/40_config.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 完成 [[important-configuration-changes]] --- 510_Deployment/40_config.asciidoc | 165 ++++++++++++------------------ 1 file changed, 63 insertions(+), 102 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index b0f29eeb8..e2b750393 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -1,9 +1,9 @@ -[[重要配置的修改]] +[[important-configuration-changes]] === 重要配置的修改 Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configuration changes, important")))((("configuration changes, important")))特别是涉及到性能相关的配置或者选项。 如果你有疑问,最好就不要动它。我们已经目睹了数十个因为错误的设置而导致毁灭的集群, -因为它的管理者总认为改动一个配置或者选项就可以带来100倍的提升。 +因为它的管理者总认为改动一个配置或者选项就可以带来 100 倍的提升。 [NOTE] ==== @@ -21,8 +21,7 @@ Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configurati ==== 指定名字 -Elasticsearch 默认启动的集群名字叫 `elasticsearch`。((("configuration changes, important", "assigning names")))你最好 -给你的生产环境的集群改个名字,改名字的目的很简单, +Elasticsearch 默认启动的集群名字叫 `elasticsearch` 。((("configuration changes, important", "assigning names")))你最好给你的生产环境的集群改个名字,改名字的目的很简单, 就是防止某个人的笔记本加入到了集群。简单修改成 `elasticsearch_production` 会很省心。 你可以在你的 `elasticsearch.yml` 文件中: @@ -33,7 +32,7 @@ cluster.name: elasticsearch_production ---- 同样,最好也修改你的节点名字。就像你现在可能发现的那样, -Elasticsearch 会在你的节点启动的时候随机给它指定一个名字。你可能会觉得这很有趣,但是当凌晨3点钟的时候, +Elasticsearch 会在你的节点启动的时候随机给它指定一个名字。你可能会觉得这很有趣,但是当凌晨 3 点钟的时候, 你还在尝试回忆哪台物理机是 `Tagak the Leopard Lord` 的时候,你就不觉得有趣了。 更重要的是,这些名字是在启动的时候产生的,每次启动节点, @@ -72,30 +71,23 @@ path.plugins: /path/to/plugins <1> 注意:你可以通过逗号分隔指定多个目录。 数据可以保存到多个不同的目录, -每个目录如果是挂载在不同的硬盘,做 RAID 0 是一个简单而有效的办法。Elasticsearch 会自动把数据分隔到不同的目录,以便提高性能。 +每个目录如果是挂载在不同的硬盘,做一个磁盘阵列( RAID 0 )是简单而有效的办法。Elasticsearch 会自动把条带化(注:RAID 0 又称为 Stripe(条带化),在磁盘阵列中,数据是以条带的方式贯穿在磁盘阵列所有硬盘中的) +数据分隔到不同的目录,以便提高性能。 -.Multiple data path safety and performance +.多个数据路径的安全性和性能 [WARNING] ==================== -Like any RAID 0 configuration, only a single copy of your data is saved to the -hard drives. If you lose a hard drive, you are _guaranteed_ to lose a portion -of your data on that machine. With luck you'll have replicas elsewhere in the -cluster which can recover the data, and/or a recent <>. - -Elasticsearch attempts to minimize the extent of data loss by striping entire -shards to a drive. That means that `Shard 0` will be placed entirely on a single -drive. Elasticsearch will not stripe a shard across multiple drives, since the -loss of one drive would corrupt the entire shard. - -This has ramifications for performance: if you are adding multiple drives -to improve the performance of a single index, it is unlikely to help since -most nodes will only have one shard, and thus one active drive. Multiple data -paths only helps if you have many indices/shards on a single node. - -Multiple data paths is a nice convenience feature, but at the end of the day, -Elasticsearch is not a software RAID package. If you need more advanced configuration, -robustness and flexibility, we encourage you to use actual software RAID packages -instead of the multiple data path feature. +如同任何磁盘阵列( RAID 0 )的配置,只有单一的数据拷贝保存到硬盘驱动器。如果你失去了一个硬盘驱动器,你 _肯定_ 会失去该计算机上的一部分数据。 +运气好的话你的副本在集群的其他地方,可以用来恢复数据和最近的备份。 + +Elasticsearch 试图将全部的条带化分片放到单个驱动器来保证最小程度的数据丢失。这意味着 `分片 0` 将完全被放置在单个驱动器上。 +Elasticsearch 没有一个条带化的分片跨越在多个驱动器,因为一个驱动器的损失会破坏整个分片。 + +这对性能产生的影响是:如果您添加多个驱动器来提高一个单独索引的性能,可能帮助不大,因为 +大多数节点只有一个分片和这样一个积极的驱动器。多个数据路径只是帮助如果你有许多索引/分片在单个节点上。 + +多个数据路径是一个非常方便的功能,但到头来,Elasticsearch 并不是软磁盘阵列( software RAID )的包。如果你需要更高级的、稳健的、灵活的配置, +我们建议你使用软磁盘阵列( software RAID )的包,而不是多个数据路径的功能。 ==================== ==== 最小主节点数 @@ -110,13 +102,13 @@ instead of the multiple data path feature. 这个配置就是告诉 Elasticsearch 当没有足够 master 候选节点的时候,就不要进行 master 节点选举,等 master 候选节点足够了才进行选举。 -此设置应该始终被配置为 master 候选节点的法定个数(大多数个)。((("quorum")))法定个数就是 `( master 候选节点个数 / 2) + 1`。 +此设置应该始终被配置为 master 候选节点的法定个数(大多数个)。((("quorum")))法定个数就是 `( master 候选节点个数 / 2) + 1` 。 这里有几个例子: -- 如果你有10个节点(能保存数据,同时能成为 master),法定数就是 `6`。 -- 如果你有3个候选 master 节点,和100个 date 节点,法定数就是 `2`,你只要数数那些可以做 master 的节点数就可以了。 -- 如果你有两个节点,你遇到难题了。法定数当然是 `2`,但是这意味着如果有一个节点挂掉,你整个集群就不可用了。 -设置成 `1` 可以保证集群的功能,但是就无法保证集群脑裂了,像这样的情况,你最好至少保证有3个节点。 +- 如果你有 10 个节点(能保存数据,同时能成为 master),法定数就是 `6` 。 +- 如果你有 3 个候选 master 节点,和 100 个 date 节点,法定数就是 `2` ,你只要数数那些可以做 master 的节点数就可以了。 +- 如果你有两个节点,你遇到难题了。法定数当然是 `2` ,但是这意味着如果有一个节点挂掉,你整个集群就不可用了。 +设置成 `1` 可以保证集群的功能,但是就无法保证集群脑裂了,像这样的情况,你最好至少保证有 3 个节点。 你可以在你的 `elasticsearch.yml` 文件中这样配置: @@ -129,7 +121,7 @@ discovery.zen.minimum_master_nodes: 2 但是这会改变这个法定个数。 你不得不修改每一个索引节点的配置并且重启你的整个集群只是为了让配置生效,这将是非常痛苦的一件事情。 -基于这个原因,`minimum_master_nodes`(还有一些其它配置)允许通过 API 调用的方式动态进行配置。 +基于这个原因, `minimum_master_nodes` (还有一些其它配置)允许通过 API 调用的方式动态进行配置。 当你的集群在线运行的时候,你可以这样修改配置: [source,js] @@ -142,53 +134,37 @@ PUT /_cluster/settings } ---- -这将成为一个永久的配置,并且无论你配置项里配置的如何,这个将优先生效。当你添加和删除master节点的时候,你需要更改这个配置。 +这将成为一个永久的配置,并且无论你配置项里配置的如何,这个将优先生效。当你添加和删除 master 节点的时候,你需要更改这个配置。 ==== 集群恢复方面的配置 当你集群重启时,几个配置项影响你的分片恢复的表现。((("recovery settings")))((("configuration changes, important", "recovery settings")))首先,我们需要明白 如果什么也没配置将会发生什么。 -Imagine you have ten nodes, and each node holds a single shard--either a primary -or a replica--in a 5 primary / 1 replica index. You take your -entire cluster offline for maintenance (installing new drives, for example). When you -restart your cluster, it just so happens that five nodes come online before -the other five. - -Maybe the switch to the other five is being flaky, and they didn't -receive the restart command right away. Whatever the reason, you have five nodes -online. These five nodes will gossip with each other, elect a master, and form a -cluster. They notice that data is no longer evenly distributed, since five -nodes are missing from the cluster, and immediately start replicating new -shards between each other. - -Finally, your other five nodes turn on and join the cluster. These nodes see -that _their_ data is being replicated to other nodes, so they delete their local -data (since it is now redundant, and may be outdated). Then the cluster starts -to rebalance even more, since the cluster size just went from five to ten. - -During this whole process, your nodes are thrashing the disk and network, moving -data around--for no good reason. For large clusters with terabytes of data, -this useless shuffling of data can take a _really long time_. If all the nodes -had simply waited for the cluster to come online, all the data would have been -local and nothing would need to move. - -Now that we know the problem, we can configure a few settings to alleviate it. -First, we need to give Elasticsearch a hard limit: +想象一下假设你有 10 个节点,每个节点只保存一个分片,这个分片是一个主分片或者是一个分片副本,或者说有一个有 5 个主分片/1 个分片副本的索引。有时你需要为整个集群做离线维护(比如,为了安装一个新的驱动程序), +当你重启你的集群,恰巧出现了 5 个节点已经启动,还有 5 个还没启动的场景。 + +假设其它 5 个节点出问题,或者他们根本没有收到立即重启的命令。不管什么原因,你有 5 个节点在线上,这五个节点会相互通信,选出一个 master,从而形成一个集群。 +他们注意到数据不再均匀分布,因为有 5 个节点在集群中丢失了,所以他们之间会立马启动分片复制。 + +最后,你的其它 5 个节点打开加入了集群。这些节点会发现 _它们_ 的数据正在被复制到其他节点,(因为这份数据要么是多余的,要么是过时的)。 +然后整个集群重新进行平衡,因为集群的大小已经从 5 变成了 10。 + +在整个过程中,你的节点会消耗磁盘和网盘,来回移动数据,因为没有更好的办法。对于有 TB 数据的大集群, +这种无用的数据传输需要 _很长时间_ 。如果等待所有的节点重启好了,整个集群再上线,所有的本地的数据都不需要移动。 + +现在我们知道问题的所在了,我们可以修改一些设置来缓解它。 +首先我们要给 ELasticsearch 一个严格的限制: [source,yaml] ---- gateway.recover_after_nodes: 8 ---- -This will prevent Elasticsearch from starting a recovery until at least eight (data or master) nodes -are present. The value for this setting is a matter of personal preference: how -many nodes do you want present before you consider your cluster functional? -In this case, we are setting it to `8`, which means the cluster is inoperable -unless there are at least eight nodes. +这将防止 Elasticsearch 从一开始就进行数据恢复,在存在 8 个节点(数据节点或者 master 节点)之前。 +这个值的设定取决于个人喜好:整个集群提供服务之前你希望有多少个节点在线?这种情况下,我们设置为 8,这意味着至少要有 8 个节点,该集群才可用。 -Then we tell Elasticsearch how many nodes _should_ be in the cluster, and how -long we want to wait for all those nodes: +现在我们要告诉 Elasticsearch 集群中 _应该_ 有多少个节点,重启这些节点我们希望等待多长时间: [source,yaml] ---- @@ -196,50 +172,35 @@ gateway.expected_nodes: 10 gateway.recover_after_time: 5m ---- -What this means is that Elasticsearch will do the following: +这意味着 Elasticsearch 会采取如下操作: -- Wait for eight nodes to be present -- Begin recovering after 5 minutes _or_ after ten nodes have joined the cluster, -whichever comes first. +- 等待集群至少存在 8 个节点 +- 等待 5 分钟,或者10 个节点上线后,才进行数据恢复,这取决于哪个条件先达到。 -These three settings allow you to avoid the excessive shard swapping that can -occur on cluster restarts. It can literally make recovery take seconds instead -of hours. +这三个设置可以在集群重启的时候避免过多的分片交换。这可能会让数据恢复从数个小时缩短为几秒钟。 -NOTE: These settings can only be set in the `config/elasticsearch.yml` file or on -the command line (they are not dynamically updatable) and they are only relevant -during a full cluster restart. +注意:这些配置只能设置在 `config/elasticsearch.yml` 文件中或者是在命令行里(它们不能动态更新)它们只在整个集群重启的时候有实质性作用。 [[unicast]] -==== Prefer Unicast over Multicast - -Elasticsearch is configured to use unicast discovery out of the box to prevent -nodes from accidentally joining a cluster. Only nodes running on the same -machine will automatically form cluster. - -While multicast is still https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[provided -as a plugin], it should never be used in production. The -last thing you want is for nodes to accidentally join your production network, simply -because they received an errant multicast ping. There is nothing wrong with -multicast _per se_. Multicast simply leads to silly problems, and can be a bit -more fragile (for example, a network engineer fiddles with the network without telling -you--and all of a sudden nodes can't find each other anymore). - -To use unicast, you provide Elasticsearch a list of nodes that it should try to contact. -When a node contacts a member of the unicast list, it receives a full cluster -state that lists all of the nodes in the cluster. It then contacts -the master and joins the cluster. - -This means your unicast list does not need to include all of the nodes in your cluster. -It just needs enough nodes that a new node can find someone to talk to. If you -use dedicated masters, just list your three dedicated masters and call it a day. -This setting is configured in `elasticsearch.yml`: + +==== 最好使用单播代替组播 + +Elasticsearch 被配置为使用单播发现,开箱即用,以防止节点无意中加入集群。只有在同一台机器上运行的节点将自动形成集群。 + +虽然组播仍然作为 https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[一个插件]提供给我们使用, +但它应该永远不被使用在生产环境了,否在你得到的结果就是一个节点意外的加入到了你的生产环境,仅仅是因为他们收到了一个错误的组播信号。 +对于组播 _本身_ 并没有错,组播会导致一些愚蠢的问题,并且导致集群变的脆弱(比如,一个网络工程师正在捣鼓网络,而没有告诉你,你会发现所有的节点突然发现不了对方了)。 + +使用单播,你可以为 Elasticsearch 提供一些它应该去尝试连接的节点列表。 +当一个节点联系到单播列表中的成员时,它就会得到整个集群所有节点的状态,然后它会联系 master 节点,并加入集群。 + +这意味着你的单播列表不需要包含你的集群中的所有节点, +它只是需要足够的节点,当一个新节点联系上其中一个并且说上话就可以了。如果你使用 master 候选节点作为单播列表,你只要列出三个就可以了。 +这个配置在 `elasticsearch.yml` 文件中: [source,yaml] ---- discovery.zen.ping.unicast.hosts: ["host1", "host2:port"] ---- -For more information about how Elasticsearch nodes find eachother, see -https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] -in the Elasticsearch Reference. +关于 Elasticsearch 节点如何找到对方的详细信息,请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] Elasticsearch 参考文献。 From bd75d0bc91e77039e859f383eed8ea46a2788068 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Fri, 11 Mar 2016 11:09:39 +0800 Subject: [PATCH 22/95] chapter46_part4: /510_Deployment/40_config.asciidoc Fix problem --- 510_Deployment/40_config.asciidoc | 32 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index e2b750393..8f3a69ca9 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -11,20 +11,20 @@ Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configurati ==== 其它数据库可能需要调优,但总得来说,Elasticsearch 不需要。 -如果你遇到了性能问题,最好的解决方法通常是更好的数据布局或者更多的节点。 +如果你遇到了性能问题,解决方法通常是更好的数据布局或者更多的节点。 在 Elasticsearch 中很少有“神奇的配置项”, -如果存在,我们也已经帮你优化了。 +如果存在,我们也已经帮你优化了! -说到这里,有一些保障性的配置需要在生产环境中做修改。 +说到这里,有一些 _逻辑上_ 的配置需要在生产环境中做修改。 这些改动是必须的,因为没有办法设定好的默认值(它取决于你的集群布局)。 ==== 指定名字 Elasticsearch 默认启动的集群名字叫 `elasticsearch` 。((("configuration changes, important", "assigning names")))你最好给你的生产环境的集群改个名字,改名字的目的很简单, -就是防止某个人的笔记本加入到了集群。简单修改成 `elasticsearch_production` 会很省心。 +就是防止某人的笔记本电脑加入了集群这种意外。简单修改成 `elasticsearch_production` 会很省心。 -你可以在你的 `elasticsearch.yml` 文件中: +你可以在你的 `elasticsearch.yml` 文件中修改: [source,yaml] ---- @@ -33,7 +33,7 @@ cluster.name: elasticsearch_production 同样,最好也修改你的节点名字。就像你现在可能发现的那样, Elasticsearch 会在你的节点启动的时候随机给它指定一个名字。你可能会觉得这很有趣,但是当凌晨 3 点钟的时候, -你还在尝试回忆哪台物理机是 `Tagak the Leopard Lord` 的时候,你就不觉得有趣了。 +你还在尝试回忆哪台物理机是 Tagak the Leopard Lord 的时候,你就不觉得有趣了。 更重要的是,这些名字是在启动的时候产生的,每次启动节点, 它都会得到一个新的名字。这会使日志变得很混乱,因为所有节点的名称都是不断变化的。 @@ -71,7 +71,7 @@ path.plugins: /path/to/plugins <1> 注意:你可以通过逗号分隔指定多个目录。 数据可以保存到多个不同的目录, -每个目录如果是挂载在不同的硬盘,做一个磁盘阵列( RAID 0 )是简单而有效的办法。Elasticsearch 会自动把条带化(注:RAID 0 又称为 Stripe(条带化),在磁盘阵列中,数据是以条带的方式贯穿在磁盘阵列所有硬盘中的) +如果将每个目录分别挂载不同的硬盘,这可是一个简单且高效实现一个软磁盘阵列( RAID 0 )的办法。Elasticsearch 会自动把条带化(注:RAID 0 又称为 Stripe(条带化),在磁盘阵列中,数据是以条带的方式贯穿在磁盘阵列所有硬盘中的) 数据分隔到不同的目录,以便提高性能。 .多个数据路径的安全性和性能 @@ -92,12 +92,11 @@ Elasticsearch 没有一个条带化的分片跨越在多个驱动器,因为一 ==== 最小主节点数 -`minimum_master_nodes` 设定对你的集群的稳定 _及其_ 重要。 +`minimum_master_nodes` 设定对你的集群的稳定 _极其_ 重要。 ((("configuration changes, important", "minimum_master_nodes setting")))((("minimum_master_nodes setting"))) -当你的集群中有两个 masters(注:主节点)的时候,这个配置有助于防止集群分裂(注:脑裂)。 +当你的集群中有两个 masters(注:主节点)的时候,这个配置有助于防止 _脑裂_ ,一种两个主节点同时存在于一个集群的现象。 -如果你的集群发生了一个脑裂,那么你的集群就会处在丢失数据的危险中,因为 -节点是被认为是这个集群的最高统治者,它决定了什么时候新的索引可以创建,多少分片要移动等等。如果你有 _两个_ masters 节点, +如果你的集群发生了脑裂,那么你的集群就会处在丢失数据的危险中,因为主节点被认为是这个集群的最高统治者,它决定了什么时候新的索引可以创建,分片是如何移动的等等。如果你有 _两个_ masters 节点, 你的数据的完整性将得不到保证,因为你有两个节点认为他们有集群的控制权。 这个配置就是告诉 Elasticsearch 当没有足够 master 候选节点的时候,就不要进行 master 节点选举,等 master 候选节点足够了才进行选举。 @@ -138,8 +137,7 @@ PUT /_cluster/settings ==== 集群恢复方面的配置 -当你集群重启时,几个配置项影响你的分片恢复的表现。((("recovery settings")))((("configuration changes, important", "recovery settings")))首先,我们需要明白 -如果什么也没配置将会发生什么。 +当你集群重启时,几个配置项影响你的分片恢复的表现。((("recovery settings")))((("configuration changes, important", "recovery settings")))首先,我们需要明白如果什么也没配置将会发生什么。 想象一下假设你有 10 个节点,每个节点只保存一个分片,这个分片是一个主分片或者是一个分片副本,或者说有一个有 5 个主分片/1 个分片副本的索引。有时你需要为整个集群做离线维护(比如,为了安装一个新的驱动程序), 当你重启你的集群,恰巧出现了 5 个节点已经启动,还有 5 个还没启动的场景。 @@ -147,10 +145,10 @@ PUT /_cluster/settings 假设其它 5 个节点出问题,或者他们根本没有收到立即重启的命令。不管什么原因,你有 5 个节点在线上,这五个节点会相互通信,选出一个 master,从而形成一个集群。 他们注意到数据不再均匀分布,因为有 5 个节点在集群中丢失了,所以他们之间会立马启动分片复制。 -最后,你的其它 5 个节点打开加入了集群。这些节点会发现 _它们_ 的数据正在被复制到其他节点,(因为这份数据要么是多余的,要么是过时的)。 +最后,你的其它 5 个节点打开加入了集群。这些节点会发现 _它们_ 的数据正在被复制到其他节点,所以他们删除本地数据(因为这份数据要么是多余的,要么是过时的)。 然后整个集群重新进行平衡,因为集群的大小已经从 5 变成了 10。 -在整个过程中,你的节点会消耗磁盘和网盘,来回移动数据,因为没有更好的办法。对于有 TB 数据的大集群, +在整个过程中,你的节点会消耗磁盘和网络带宽,来回移动数据,因为没有更好的办法。对于有 TB 数据的大集群, 这种无用的数据传输需要 _很长时间_ 。如果等待所有的节点重启好了,整个集群再上线,所有的本地的数据都不需要移动。 现在我们知道问题的所在了,我们可以修改一些设置来缓解它。 @@ -164,7 +162,7 @@ gateway.recover_after_nodes: 8 这将防止 Elasticsearch 从一开始就进行数据恢复,在存在 8 个节点(数据节点或者 master 节点)之前。 这个值的设定取决于个人喜好:整个集群提供服务之前你希望有多少个节点在线?这种情况下,我们设置为 8,这意味着至少要有 8 个节点,该集群才可用。 -现在我们要告诉 Elasticsearch 集群中 _应该_ 有多少个节点,重启这些节点我们希望等待多长时间: +现在我们要告诉 Elasticsearch 集群中 _应该_ 有多少个节点,以及我们愿意为这些节点等待多长时间: [source,yaml] ---- @@ -203,4 +201,4 @@ Elasticsearch 被配置为使用单播发现,开箱即用,以防止节点无 discovery.zen.ping.unicast.hosts: ["host1", "host2:port"] ---- -关于 Elasticsearch 节点如何找到对方的详细信息,请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] Elasticsearch 参考文献。 +关于 Elasticsearch 节点发现的详细信息,请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] Elasticsearch 文献。 From 8c56850a64d4735e04b37479c60bf0dcc020e5ac Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Fri, 11 Mar 2016 11:50:12 +0800 Subject: [PATCH 23/95] chapter46_part4: /510_Deployment/40_config.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 遗漏的问题修改 --- 510_Deployment/40_config.asciidoc | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 8f3a69ca9..81e1a3f3a 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -15,9 +15,8 @@ Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configurati 在 Elasticsearch 中很少有“神奇的配置项”, 如果存在,我们也已经帮你优化了! -说到这里,有一些 _逻辑上_ 的配置需要在生产环境中做修改。 -这些改动是必须的,因为没有办法设定好的默认值(它取决于你的集群布局)。 - +也就是说,有些配置在生成环境中是应该调整的。 +这些变化会让你的生活更轻松,因为没有办法设定好的默认值(它取决于你的集群布局)。 ==== 指定名字 @@ -183,9 +182,9 @@ gateway.recover_after_time: 5m ==== 最好使用单播代替组播 -Elasticsearch 被配置为使用单播发现,开箱即用,以防止节点无意中加入集群。只有在同一台机器上运行的节点将自动形成集群。 +Elasticsearch 默认被配置为使用单播发现,以防止节点无意中加入集群。只有在同一台机器上运行的节点才会自动组成集群。 -虽然组播仍然作为 https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[一个插件]提供给我们使用, +虽然组播仍然 https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[作为插件提供], 但它应该永远不被使用在生产环境了,否在你得到的结果就是一个节点意外的加入到了你的生产环境,仅仅是因为他们收到了一个错误的组播信号。 对于组播 _本身_ 并没有错,组播会导致一些愚蠢的问题,并且导致集群变的脆弱(比如,一个网络工程师正在捣鼓网络,而没有告诉你,你会发现所有的节点突然发现不了对方了)。 From 27bc2c283dd29a0e8d1cef5d267e71e8e8fb3897 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Fri, 11 Mar 2016 15:21:57 +0800 Subject: [PATCH 24/95] chapter12_part2: /080_Structured_Search/10_compoundfilters.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit chapter12_part2: /080_Structured_Search/10_compoundfilters.asciidoc 第二部分:第12章,结构化搜索,组合过滤器 --- .../10_compoundfilters.asciidoc | 101 +++++++++--------- 1 file changed, 49 insertions(+), 52 deletions(-) diff --git a/080_Structured_Search/10_compoundfilters.asciidoc b/080_Structured_Search/10_compoundfilters.asciidoc index 28fd27b88..6ab7c5c23 100644 --- a/080_Structured_Search/10_compoundfilters.asciidoc +++ b/080_Structured_Search/10_compoundfilters.asciidoc @@ -1,9 +1,9 @@ [[combining-filters]] -=== Combining Filters +=== 组合过滤器 -The previous two examples showed a single filter in use.((("structured search", "combining filters")))((("filters", "combining"))) In practice, you -will probably need to filter on multiple values or fields. For example, how -would you express this SQL in Elasticsearch? +前面的两个例子都是单个过滤器(filter)的使用方式。((("structured search", "combining filters")))((("filters", "combining"))) 在实际应用中,我们 +很有可能会过滤多个值或字段。比方说,怎样 +用 Elasticsearch 来表达下面的 SQL ? [source,sql] -------------------------------------------------- @@ -13,14 +13,14 @@ WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) -------------------------------------------------- -In these situations, you will need the `bool` filter.((("filters", "combining", "in bool filter")))((("bool filter"))) This is a _compound -filter_ that accepts other filters as arguments, combining them in various -Boolean combinations. +这种情况下,我们需要 `bool` (布尔)过滤器。((("filters", "combining", "in bool filter")))((("bool filter"))) 这是个 _复合过滤器(compound filter)_ , +它可以接受多个其他过滤器作为参数,并将这些过滤器结合构成各式各样的 +布尔(逻辑)组合。 [[bool-filter]] -==== Bool Filter +==== 布尔过滤器 -The `bool` filter is composed of three sections: +一个 `bool` 过滤器由三部分组成: [source,js] -------------------------------------------------- @@ -33,28 +33,27 @@ The `bool` filter is composed of three sections: } -------------------------------------------------- - `must`:: - All of these clauses _must_ match. The equivalent of `AND`. - - `must_not`:: - All of these clauses _must not_ match. The equivalent of `NOT`. - - `should`:: - At least one of these clauses must match. The equivalent of `OR`. + `must`:: + 所有的语句都 _必须(must)_ 匹配,与 `AND` 等价。 -And that's it!((("should clause", "in bool filters")))((("must_not clause", "in bool filters")))((("must clause", "in bool filters"))) When you need multiple filters, simply place them into the -different sections of the `bool` filter. + `must_not`:: + 所有的语句都 _不能(must not)_ 匹配,与 `NOT` 等价。 + + `should`:: + 至少有一个语句要匹配,与 `OR` 等价。 + +就这么简单!((("should clause", "in bool filters")))((("must_not clause", "in bool filters")))((("must clause", "in bool filters"))) 当我们需要多个过滤器时,只须将它们置入 + `bool` 过滤器的不同部分即可。 [NOTE] ==== -Each section of the `bool` filter is optional (for example, you can have a `must` -clause and nothing else), and each section can contain a single filter or an -array of filters. +一个 `bool` 过滤器的每个部分都是可选的(例如,我们可以只有一个 `must` 语句), +而且每个部分内部可以只有一个或一组过滤器。 ==== -To replicate the preceding SQL example, we will take the two `term` filters that -we used((("term filter", "placing inside bool filter")))((("bool filter", "with two term filters in should clause and must_not clause"))) previously and place them inside the `should` clause of a `bool` -filter, and add another clause to deal with the `NOT` condition: +用 Elasticsearch 来表示本部分开始处的 SQL 例子,将两个 `term` 过滤器 +置入 `bool` 过滤器的 `should` 语句 +内,再增加一个语句处理 `NOT` (非)的条件: [source,js] -------------------------------------------------- @@ -79,14 +78,14 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> Note that we still need to use a `filtered` query to wrap everything. -<2> These two `term` filters are _children_ of the `bool` filter, and since they - are placed inside the `should` clause, at least one of them needs to match. -<3> If a product has a price of `30`, it is automatically excluded because it - matches a `must_not` clause. +<1> 注意,我们仍然需要一个 `filtered` 查询将所有的东西包起来。 +<2> 在 `should` 语句块里面的两个 `term` 过滤器 + 与 `bool` 过滤器是父子关系,两个 `term` 条件需要匹配其一。 +<3> 如果一个产品的价格是 `30` ,那么它会自动被排除,因为它 + 处于 `must_not` 语句里面。 -Our search results return two hits, each document satisfying a different clause -in the `bool` filter: +我们搜索的结果返回了 2 个命中结果,两个文档分别匹配了 `bool` 过滤器 +其中的一个条件: [source,json] -------------------------------------------------- @@ -109,17 +108,17 @@ in the `bool` filter: } ] -------------------------------------------------- -<1> Matches the `term` filter for `productID = "XHDK-A-1293-#fJ3"` -<2> Matches the `term` filter for `price = 20` +<1> 与 `term` 过滤器中 `productID = "XHDK-A-1293-#fJ3"` 条件匹配 +<2> 与 `term` 过滤器中 `price = 20` 条件匹配 -==== Nesting Boolean Filters +==== 嵌套布尔过滤器 -Even though `bool` is a compound filter and accepts children filters, it is -important to understand that `bool` is just a filter itself.((("filters", "combining", "nesting bool filters")))((("bool filter", "nesting in another bool filter"))) This means you -can nest `bool` filters inside other `bool` filters, giving you the -ability to make arbitrarily complex Boolean logic. +尽管 `bool` 是一个复合的过滤器,可以接受多个子过滤器,需要 +注意的是 `bool` 过滤器本身仍然还只是一个过滤器。((("filters", "combining", "nesting bool filters")))((("bool filter", "nesting in another bool filter"))) 这意味着我们 +可以将一个 `bool` 过滤器置于其他 `bool` 过滤器内部,这为我们 +提供了对任意复杂布尔逻辑进行处理的能力。 -Given this SQL statement: +对于以下这个 SQL 语句: [source,sql] -------------------------------------------------- @@ -130,7 +129,7 @@ WHERE productID = "KDKE-B-9947-#kL5" AND price = 30 ) -------------------------------------------------- -We can translate it into a pair of nested `bool` filters: +我们将其转换成一组嵌套的 `bool` 过滤器: [source,js] -------------------------------------------------- @@ -157,14 +156,12 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> Because the `term` and the `bool` are sibling clauses inside the first - Boolean `should`, at least one of these filters must match for a document - to be a hit. - -<2> These two `term` clauses are siblings in a `must` clause, so they both - have to match for a document to be returned as a hit. +<1> 因为 `term` 和 `bool` 过滤器是兄弟关系,他们都处于外层的 + 布尔逻辑 `should` 的内部,返回的命中文档至少须匹配其中一个过滤器的条件。 +<2> 这两个 `term` 语句作为兄弟关系,同时处于 `must` 语句之中,所以 + 返回的命中文档要必须都能同时匹配这两个条件。 -The results show us two documents, one matching each of the `should` clauses: +得到的结果有两个文档,它们各匹配 `should` 语句中的一个条件: [source,json] -------------------------------------------------- @@ -187,8 +184,8 @@ The results show us two documents, one matching each of the `should` clauses: } ] -------------------------------------------------- -<1> This `productID` matches the `term` in the first `bool`. -<2> These two fields match the `term` filters in the nested `bool`. +<1> 这个 `productID` 与外层的 `bool` 过滤器 `should` 里的唯一一个 `term` 匹配。 +<2> 这两个字段与嵌套的 `bool` 过滤器 `must` 里的两个 `term` 匹配。 -This was a simple example, but it demonstrates how Boolean filters can be -used as building blocks to construct complex logical conditions. +这只是个简单的例子,但足以展示布尔过滤器可以 +用来作为构造复杂逻辑条件的基本构建模块。 From 10cb12f6dee782a17d8fe02f53ba54977a55631f Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Sat, 12 Mar 2016 14:02:21 +0800 Subject: [PATCH 25/95] chapter12_part3: /080_Structured_Search/10_compoundfilters.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit fix: 1) 换行问题 2) 结合构成 -> 结合成 这样是否跟通顺? 3) Line 19: 此处应该用中文标点 4) Line 48: 括号里的非是否多余? --- .../10_compoundfilters.asciidoc | 43 ++++++------------- 1 file changed, 13 insertions(+), 30 deletions(-) diff --git a/080_Structured_Search/10_compoundfilters.asciidoc b/080_Structured_Search/10_compoundfilters.asciidoc index 6ab7c5c23..f11578316 100644 --- a/080_Structured_Search/10_compoundfilters.asciidoc +++ b/080_Structured_Search/10_compoundfilters.asciidoc @@ -1,9 +1,7 @@ [[combining-filters]] === 组合过滤器 -前面的两个例子都是单个过滤器(filter)的使用方式。((("structured search", "combining filters")))((("filters", "combining"))) 在实际应用中,我们 -很有可能会过滤多个值或字段。比方说,怎样 -用 Elasticsearch 来表达下面的 SQL ? +前面的两个例子都是单个过滤器(filter)的使用方式。((("structured search", "combining filters")))((("filters", "combining"))) 在实际应用中,我们很有可能会过滤多个值或字段。比方说,怎样用 Elasticsearch 来表达下面的 SQL ? [source,sql] -------------------------------------------------- @@ -13,14 +11,12 @@ WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) -------------------------------------------------- -这种情况下,我们需要 `bool` (布尔)过滤器。((("filters", "combining", "in bool filter")))((("bool filter"))) 这是个 _复合过滤器(compound filter)_ , -它可以接受多个其他过滤器作为参数,并将这些过滤器结合构成各式各样的 -布尔(逻辑)组合。 +这种情况下,我们需要 `bool` (布尔)过滤器。((("filters", "combining", "in bool filter")))((("bool filter"))) 这是个 _复合过滤器(compound filter)_ ,它可以接受多个其他过滤器作为参数,并将这些过滤器结合成各式各样的布尔(逻辑)组合。 [[bool-filter]] ==== 布尔过滤器 -一个 `bool` 过滤器由三部分组成: +一个 `bool` 过滤器由三部分组成: [source,js] -------------------------------------------------- @@ -42,18 +38,14 @@ WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") `should`:: 至少有一个语句要匹配,与 `OR` 等价。 -就这么简单!((("should clause", "in bool filters")))((("must_not clause", "in bool filters")))((("must clause", "in bool filters"))) 当我们需要多个过滤器时,只须将它们置入 - `bool` 过滤器的不同部分即可。 +就这么简单!((("should clause", "in bool filters")))((("must_not clause", "in bool filters")))((("must clause", "in bool filters"))) 当我们需要多个过滤器时,只须将它们置入 `bool` 过滤器的不同部分即可。 [NOTE] ==== -一个 `bool` 过滤器的每个部分都是可选的(例如,我们可以只有一个 `must` 语句), -而且每个部分内部可以只有一个或一组过滤器。 +一个 `bool` 过滤器的每个部分都是可选的(例如,我们可以只有一个 `must` 语句),而且每个部分内部可以只有一个或一组过滤器。 ==== -用 Elasticsearch 来表示本部分开始处的 SQL 例子,将两个 `term` 过滤器 -置入 `bool` 过滤器的 `should` 语句 -内,再增加一个语句处理 `NOT` (非)的条件: +用 Elasticsearch 来表示本部分开始处的 SQL 例子,将两个 `term` 过滤器置入 `bool` 过滤器的 `should` 语句内,再增加一个语句处理 `NOT` 非的条件: [source,js] -------------------------------------------------- @@ -79,13 +71,10 @@ GET /my_store/products/_search // SENSE: 080_Structured_Search/10_Bool_filter.json <1> 注意,我们仍然需要一个 `filtered` 查询将所有的东西包起来。 -<2> 在 `should` 语句块里面的两个 `term` 过滤器 - 与 `bool` 过滤器是父子关系,两个 `term` 条件需要匹配其一。 -<3> 如果一个产品的价格是 `30` ,那么它会自动被排除,因为它 - 处于 `must_not` 语句里面。 +<2> 在 `should` 语句块里面的两个 `term` 过滤器与 `bool` 过滤器是父子关系,两个 `term` 条件需要匹配其一。 +<3> 如果一个产品的价格是 `30` ,那么它会自动被排除,因为它处于 `must_not` 语句里面。 -我们搜索的结果返回了 2 个命中结果,两个文档分别匹配了 `bool` 过滤器 -其中的一个条件: +我们搜索的结果返回了 2 个命中结果,两个文档分别匹配了 `bool` 过滤器其中的一个条件: [source,json] -------------------------------------------------- @@ -113,10 +102,7 @@ GET /my_store/products/_search ==== 嵌套布尔过滤器 -尽管 `bool` 是一个复合的过滤器,可以接受多个子过滤器,需要 -注意的是 `bool` 过滤器本身仍然还只是一个过滤器。((("filters", "combining", "nesting bool filters")))((("bool filter", "nesting in another bool filter"))) 这意味着我们 -可以将一个 `bool` 过滤器置于其他 `bool` 过滤器内部,这为我们 -提供了对任意复杂布尔逻辑进行处理的能力。 +尽管 `bool` 是一个复合的过滤器,可以接受多个子过滤器,需要注意的是 `bool` 过滤器本身仍然还只是一个过滤器。((("filters", "combining", "nesting bool filters")))((("bool filter", "nesting in another bool filter"))) 这意味着我们可以将一个 `bool` 过滤器置于其他 `bool` 过滤器内部,这为我们提供了对任意复杂布尔逻辑进行处理的能力。 对于以下这个 SQL 语句: @@ -156,10 +142,8 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> 因为 `term` 和 `bool` 过滤器是兄弟关系,他们都处于外层的 - 布尔逻辑 `should` 的内部,返回的命中文档至少须匹配其中一个过滤器的条件。 -<2> 这两个 `term` 语句作为兄弟关系,同时处于 `must` 语句之中,所以 - 返回的命中文档要必须都能同时匹配这两个条件。 +<1> 因为 `term` 和 `bool` 过滤器是兄弟关系,他们都处于外层的布尔逻辑 `should` 的内部,返回的命中文档至少须匹配其中一个过滤器的条件。 +<2> 这两个 `term` 语句作为兄弟关系,同时处于 `must` 语句之中,所以返回的命中文档要必须都能同时匹配这两个条件。 得到的结果有两个文档,它们各匹配 `should` 语句中的一个条件: @@ -187,5 +171,4 @@ GET /my_store/products/_search <1> 这个 `productID` 与外层的 `bool` 过滤器 `should` 里的唯一一个 `term` 匹配。 <2> 这两个字段与嵌套的 `bool` 过滤器 `must` 里的两个 `term` 匹配。 -这只是个简单的例子,但足以展示布尔过滤器可以 -用来作为构造复杂逻辑条件的基本构建模块。 +这只是个简单的例子,但足以展示布尔过滤器可以用来作为构造复杂逻辑条件的基本构建模块。 From 935de69e36293547ecf21d9aeba448f084d5682d Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Mon, 14 Mar 2016 09:44:31 +0800 Subject: [PATCH 26/95] chapter46_part4: /510_Deployment/40_config.asciidoc Fix problem --- 510_Deployment/40_config.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 81e1a3f3a..2cecbdb34 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -85,8 +85,8 @@ Elasticsearch 没有一个条带化的分片跨越在多个驱动器,因为一 这对性能产生的影响是:如果您添加多个驱动器来提高一个单独索引的性能,可能帮助不大,因为 大多数节点只有一个分片和这样一个积极的驱动器。多个数据路径只是帮助如果你有许多索引/分片在单个节点上。 -多个数据路径是一个非常方便的功能,但到头来,Elasticsearch 并不是软磁盘阵列( software RAID )的包。如果你需要更高级的、稳健的、灵活的配置, -我们建议你使用软磁盘阵列( software RAID )的包,而不是多个数据路径的功能。 +多个数据路径是一个非常方便的功能,但到头来,Elasticsearch 并不是软磁盘阵列( software RAID )的软件。如果你需要更高级的、稳健的、灵活的配置, +我们建议你使用软磁盘阵列( software RAID )的软件,而不是多个数据路径的功能。 ==================== ==== 最小主节点数 From ca9acfd4b3cf16f7a9073da93554c18d4225f540 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Mon, 14 Mar 2016 17:00:30 +0800 Subject: [PATCH 27/95] chapter13_part1: /100_Full_Text_Search/10_Multi_word_queries.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 第二部分:第13章,全文搜索,多词查询 --- .../10_Multi_word_queries.asciidoc | 68 ++++++------------- 1 file changed, 20 insertions(+), 48 deletions(-) diff --git a/100_Full_Text_Search/10_Multi_word_queries.asciidoc b/100_Full_Text_Search/10_Multi_word_queries.asciidoc index dcb5fa7d9..8ee757037 100644 --- a/100_Full_Text_Search/10_Multi_word_queries.asciidoc +++ b/100_Full_Text_Search/10_Multi_word_queries.asciidoc @@ -1,9 +1,7 @@ [[match-multi-word]] -=== Multiword Queries +=== 多词查询 -If we could search for only one word at a time, full-text search would be -pretty inflexible. Fortunately, the `match` query((("full text search", "multi-word queries")))((("match query", "multi-word query"))) makes multiword queries -just as simple: +如果我们一次只能搜索一个词,那么全文搜索就会不太灵活,幸运的是 `match` 查询让多词查询变得简单:((("full text search", "multi-word queries")))((("match query", "multi-word query"))) [source,js] -------------------------------------------------- @@ -18,7 +16,7 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/05_Match_query.json -The preceding query returns all four documents in the results list: +上面这个查询返回所有四个文档: [source,js] -------------------------------------------------- @@ -56,33 +54,22 @@ The preceding query returns all four documents in the results list: } -------------------------------------------------- -<1> Document 4 is the most relevant because it contains `"brown"` twice and `"dog"` - once. +<1> 文档 4 最相关,因为它包含词 `"brown"` 两次以及 `"dog"` 一次。 -<2> Documents 2 and 3 both contain `brown` and `dog` once each, and the `title` - field is the same length in both docs, so they have the same score. +<2> 文档 2、3 同时包含 `brown` 和 `dog` 各一次,而且它们 `title` 字段的长度相同,所以具有相同的评分。 -<3> Document 1 matches even though it contains only `brown`, not `dog`. +<3> 文档 1 也能匹配,尽管它只有 `brown` 没有 `dog` 。 -Because the `match` query has to look for two terms—`["brown","dog"]`—internally it has to execute two `term` queries and combine their individual -results into the overall result. To do this, it wraps the two `term` queries -in a `bool` query, which we examine in detail in <>. +因为 `match` 查询必须查找两个词( `["brown","dog"]` ),它在内部实际上先执行两次 `term` 查询,然后将两次查询的结果合并作为最终结果输出。为了做到这点,它将两个 `term` 查询包入一个 `bool` 查询中,详细信息见 <>。 -The important thing to take away from this is that any document whose -`title` field contains _at least one of the specified terms_ will match the -query. The more terms that match, the more relevant the document. +以上示例告诉我们一个重要信息:即任何文档只要 `title` 字段里包含 _指定词项中的至少一个词_ 就能匹配,被匹配的词项越多,文档就越相关。 [[match-improving-precision]] -==== Improving Precision +==== 提高精度 -Matching any document that contains _any_ of the query terms may result in a -long tail of seemingly irrelevant results. ((("full text search", "multi-word queries", "improving precision")))((("precision", "improving for full text search multi-word queries"))) It's a shotgun approach to search. -Perhaps we want to show only documents that contain _all_ of the query terms. -In other words, instead of `brown OR dog`, we want to return only documents -that match `brown AND dog`. +用 _任意_ 查询词项匹配文档可能会导致结果中出现不相关的长尾。((("full text search", "multi-word queries", "improving precision")))((("precision", "improving for full text search multi-word queries")))这是种散弹式搜索。可能我们只想搜索包含 _所有_ 词项的文档,也就是说,不去匹配 `brown OR dog` ,而通过匹配 `brown AND dog` 找到所有文档。 -The `match` query accepts an `operator` parameter((("match query", "operator parameter")))((("or operator", "in match queries")))((("and operator", "in match queries"))) that defaults to `or`. -You can change it to `and` to require that all specified terms must match: +`match` 查询还可以接受 `operator` 操作符作为输入参数,默认情况下该操作符是 `or` 。我们可以将它修改成 `and` 让所有指定词项都必须匹配: [source,js] -------------------------------------------------- @@ -100,27 +87,18 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/05_Match_query.json -<1> The structure of the `match` query has to change slightly in order to - accommodate the `operator` parameter. +<1> `match` 查询的结构需要做稍许调整才能使用 `operator` 操作符参数。 -This query would exclude document 1, which contains only one of the two terms. +这个查询可以把文档 1 排除在外,因为它只包含两个词项中的一个。 [[match-precision]] -==== Controlling Precision +==== 控制精度 -The choice between _all_ and _any_ is a bit((("full text search", "multi-word queries", "controlling precision"))) too black-or-white. What if the -user specified five query terms, and a document contains only four of them? -Setting `operator` to `and` would exclude this document. +在 _所有_ 与 _任意_ 间二选一有点过于非黑即白。((("full text search", "multi-word queries", "controlling precision")))如果用户给定 5 个查询词项,想查找只包含其中 4 个的文档,该如何处理?将 `operator` 操作符参数设置成 `and` 只会将此文档排除。 -Sometimes that is exactly what you want, but for most full-text search use -cases, you want to include documents that may be relevant but exclude those -that are unlikely to be relevant. In other words, we need something -in-between. +有时候这正是我们期望的,但在全文搜索的大多数应用场景下,我们既想包含那些可能相关的文档,同时又排除那些不太相关的。换句话说,我们想要处于中间某种结果。 -The `match` query supports((("match query", "minimum_should_match parameter")))((("minimum_should_match parameter"))) the `minimum_should_match` parameter, which allows -you to specify the number of terms that must match for a document to be considered -relevant. While you can specify an absolute number of terms, it usually makes -sense to specify a percentage instead, as you have no control over the number of words the user may enter: +`match` 查询支持 `minimum_should_match` 最小匹配参数,((("match query", "minimum_should_match parameter")))((("minimum_should_match parameter")))这让我们可以指定必须匹配的词项数用来表示一个文档是否相关。我们可以将其设置为某个具体数字,更常用的做法是将其设置为一个百分数,因为我们无法控制用户搜索时输入的单词数量: [source,js] -------------------------------------------------- @@ -138,18 +116,12 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/05_Match_query.json -When specified as a percentage, `minimum_should_match` does the right thing: -in the preceding example with three terms, `75%` would be rounded down to `66.6%`, -or two out of the three terms. No matter what you set it to, at least one term -must match for a document to be considered a match. +当给定百分比的时候, `minimum_should_match` 会做合适的事情:在之前三词项的示例中, `75%` 会自动被截断成 `66.6%` ,即三个里面两个词。无论这个值设置成什么,至少包含一个词项的文档才会被认为是匹配的。 [NOTE] ==== -The `minimum_should_match` parameter is flexible, and different rules can -be applied depending on the number of terms the user enters. For the full -documentation see the +参数 `minimum_should_match` 的设置非常灵活,可以根据用户输入词项的数目应用不同的规则。完整的信息参考文档 {ref}/query-dsl-minimum-should-match.html#query-dsl-minimum-should-match ==== -To fully understand how the `match` query handles multiword queries, we need -to look at how to combine multiple queries with the `bool` query. +为了完全理解 `match` 是如何处理多词查询的,我们就需要查看如何使用 `bool` 查询将多个查询条件组合在一起。 From 26de281d62c2881b31c9e5ea8c7019ca1c10b4eb Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Mon, 14 Mar 2016 18:24:44 +0800 Subject: [PATCH 28/95] chapter46_part2: /510_Deployment/45_dont_touch.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 不要触碰这些配置! --- 510_Deployment/45_dont_touch.asciidoc | 95 +++++++++------------------ 1 file changed, 31 insertions(+), 64 deletions(-) diff --git a/510_Deployment/45_dont_touch.asciidoc b/510_Deployment/45_dont_touch.asciidoc index 37506390f..c25cc5187 100644 --- a/510_Deployment/45_dont_touch.asciidoc +++ b/510_Deployment/45_dont_touch.asciidoc @@ -1,84 +1,51 @@ -=== Don't Touch These Settings! +=== 不要触碰这些配置! -There are a few hotspots in Elasticsearch that people just can't seem to avoid -tweaking. ((("deployment", "settings to leave unaltered"))) We understand: knobs just beg to be turned. But of all the knobs to turn, these you should _really_ leave alone. They are -often abused and will contribute to terrible stability or terrible performance. -Or both. +在 Elasticsearch 中有一些热点,人们可能不可避免的会碰到。我们理解的,所有的调整就是为了优化,但是这些调整,你真的不需要理会它。因为它们经常会被乱用,从而造成系统的不稳定或者糟糕的性能,甚至两者都有可能。 -==== Garbage Collector +==== 垃圾回收器 -As briefly introduced in <>, the JVM uses a garbage -collector to free unused memory.((("garbage collector"))) This tip is really an extension of the last tip, -but deserves its own section for emphasis: +这里已经简要介绍了 <>,JVM 使用一个垃圾回收器来释放不再使用的内存。((("garbage collector"))) 这篇内容的确是上一篇的一个延续, +但是因为重要,所以值得单独拿出来作为一节。 -Do not change the default garbage collector! +不要更改默认的垃圾回收器! -The default GC for Elasticsearch is Concurrent-Mark and Sweep (CMS).((("Concurrent-Mark and Sweep (CMS) garbage collector"))) This GC -runs concurrently with the execution of the application so that it can minimize -pauses. It does, however, have two stop-the-world phases. It also has trouble -collecting large heaps. +Elasticsearch 默认的垃圾回收器( GC )是 CMS。((("Concurrent-Mark and Sweep (CMS) garbage collector"))) 这个垃圾回收器可以和应用并行处理,以便它可以最小化停顿。 +然而,它有两个 stop-the-world 阶段,处理大内存也有点吃力。 -Despite these downsides, it is currently the best GC for low-latency server software -like Elasticsearch. The official recommendation is to use CMS. +尽管有这些缺点,它还是目前对于像 Elasticsearch 这样低延迟需求软件的最佳垃圾回收器。官方建议使用 CMS。 -There is a newer GC called the Garbage First GC (G1GC). ((("Garbage First GC (G1GC)"))) This newer GC is designed -to minimize pausing even more than CMS, and operate on large heaps. It works -by dividing the heap into regions and predicting which regions contain the most -reclaimable space. By collecting those regions first (_garbage first_), it can -minimize pauses and operate on very large heaps. +现在有一款新的垃圾回收器,叫 G1 垃圾回收器( G1GC )。((("Garbage First GC (G1GC)"))) 这款新的 GC 被设计,旨在比 CMS 更小的暂停时间,以及对大内存的处理能力。 +它的原理是把内存分成许多区域,并且预测哪些区域最有可能需要回收内存。( _G1GC_ )通过收集这些区域,产生更小的暂停时间,从而能应对更大的内存。 -Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found routinely. -These bugs are usually of the segfault variety, and will cause hard crashes. -The Lucene test suite is brutal on GC algorithms, and it seems that G1GC hasn't -had the kinks worked out yet. +听起来很棒!遗憾的是,G1GC 还是太新了,经常发现新的 bugs。这些错误通常是分段错误的类型,便造成硬盘的崩溃。 +Lucene 的测试套件对 GC 是很严格的,看起来这些缺陷 G1GC 并没有很好地解决。 -We would like to recommend G1GC someday, but for now, it is simply not stable -enough to meet the demands of Elasticsearch and Lucene. +我们很希望在将来某一天推荐使用 G1GC,但是对于现在,它还不能足够稳定的满足 Elasticsearch 和 Lucene 的要求。 -==== Threadpools +==== 线程池 -Everyone _loves_ to tweak threadpools.((("threadpools"))) For whatever reason, it seems people -cannot resist increasing thread counts. Indexing a lot? More threads! Searching -a lot? More threads! Node idling 95% of the time? More threads! +许多人 _喜欢_ 调整线程池。((("threadpools"))) 无论什么原因,人们好像都无法抵挡的想增加线程数。索引太多了?增加线程!搜索太多了?增加线程!节点空闲率低于 95%?增加线程! -The default threadpool settings in Elasticsearch are very sensible. For all -threadpools (except `search`) the threadcount is set to the number of CPU cores. -If you have eight cores, you can be running only eight threads simultaneously. It makes -sense to assign only eight threads to any particular threadpool. +Elasticsearch 默认的线程设置已经是很合理的了。对于所有的线程池(除了 `搜索` ),线程个数是根据 CPU 核心数设置的。 +如果你有 8 个核,你可以同时运行的只有 8 个线程,只分配 8 个线程给任何特定的线程池是有道理的。 -Search gets a larger threadpool, and is configured to `int((# of cores * 3) / 2) + 1`. - -You might argue that some threads can block (such as on a disk I/O operation), -which is why you need more threads. This is not a problem in Elasticsearch: -much of the disk I/O is handled by threads managed by Lucene, not Elasticsearch. - -Furthermore, threadpools cooperate by passing work between each other. You don't -need to worry about a networking thread blocking because it is waiting on a disk -write. The networking thread will have long since handed off that work unit to -another threadpool and gotten back to networking. - -Finally, the compute capacity of your process is finite. Having more threads just forces -the processor to switch thread contexts. A processor can run only one thread -at a time, so when it needs to switch to a different thread, it stores the current -state (registers, and so forth) and loads another thread. If you are lucky, the switch -will happen on the same core. If you are unlucky, the switch may migrate to a -different core and require transport on an inter-core communication bus. - -This context switching eats up cycles simply by doing administrative housekeeping; estimates can peg it as high as 30μs on modern CPUs. So unless the thread -will be blocked for longer than 30μs, it is highly likely that that time would -have been better spent just processing and finishing early. - -People routinely set threadpools to silly values. On eight core machines, we have -run across configs with 60, 100, or even 1000 threads. These settings will simply -thrash the CPU more than getting real work done. - -So. Next time you want to tweak a threadpool, please don't. And if you -_absolutely cannot resist_, please keep your core count in mind and perhaps set -the count to double. More than that is just a waste. +搜索线程池设置的大一点,配置为 `int(( 核心数 * 3 )/ 2 )+ 1` 。 +你可能会认为某些线程可能会阻塞(如磁盘上的 I/O 操作),所以你才想加大线程的。这并不是 Elasticsearch 的一个问题: +因为大多数 I/O 的操作是由 Lucene 线程管理的,而不是 Elasticsearch。 +此外,线程池通过传递彼此之间的工作配合。你不必再因为它正在等待磁盘写操作而担心网络线程阻塞, +因为网络线程早已把这个工作交给另外的线程池,并且网络进行了响应。 +最后,你的处理器的计算容量是有限的,拥有更多的线程会导致你的处理器频繁切换线程上下文。 +一个处理器同时只能运行一个线程。所以当它需要切换到其它不同的线程的时候,它会存储当前的状态(寄存器等等),然后加载另外一个线程。 +如果幸运的话,这个切换发生在同一个核心,如果不幸的话,这个切换可能发生在不同的核心,这就需要在内核间总线上进行传输。 +这个上下文的切换,会循环的带来管理调度开销;在现代的 CPUs 上,开销估计高达 30 μs。也就是说线程会被堵塞超过 30 μs,如果这个时间用于线程的运行,极有可能早就结束了。 +人们经常稀里糊涂的设置线程池的值。8 个核的 CUP,我们遇到过有人配了 60、100 甚至 1000 个线程。 +这些设置只会让 CPU 实际工作效率更低。 +所以,下次请不要调整线程池的线程数。如果你真 _想调整_ , +一定要关注你的 CPU 核心数,最多设置成核心数的两倍,再多了都是浪费。 From c40a6be17676a4e4497e38a40c73c7c0fdcbaea6 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Mon, 14 Mar 2016 18:45:38 +0800 Subject: [PATCH 29/95] chapter13_part4: /100_Full_Text_Search/chapter13_part1: /100_Full_Text_Search/15_Combining_queries.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 第二部分:第13章,全文搜索,组合查询 --- .../15_Combining_queries.asciidoc | 53 ++++++------------- 1 file changed, 16 insertions(+), 37 deletions(-) diff --git a/100_Full_Text_Search/15_Combining_queries.asciidoc b/100_Full_Text_Search/15_Combining_queries.asciidoc index 20f8b2fc4..8064a109b 100644 --- a/100_Full_Text_Search/15_Combining_queries.asciidoc +++ b/100_Full_Text_Search/15_Combining_queries.asciidoc @@ -1,16 +1,11 @@ [[bool-query]] -=== Combining Queries +=== 组合查询 -In <> we discussed how to((("full text search", "combining queries"))), use the `bool` filter to combine -multiple filter clauses with `and`, `or`, and `not` logic. In query land, the -`bool` query does a similar job but with one important difference. +在 <> 中,我们讨论过如何使用 `bool` 过滤器通过 `and` 、 `or` 和 `not` 逻辑组合将多个过滤器进行组合。在查询中, `bool` 查询有类似的功能,只有一个重要的区别。 -Filters make a binary decision: should this document be included in the -results list or not? Queries, however, are more subtle. They decide not only -whether to include a document, but also how _relevant_ that document is. +过滤器做二元判断:文档是否应该出现在结果中?但查询更精妙,它除了决定一个文档是否应该被包括在结果中,还会计算文档的 _相关程度_ 。 -Like the filter equivalent, the `bool` query accepts((("bool query"))) multiple query clauses -under the `must`, `must_not`, and `should` parameters. For instance: +与过滤器一样, `bool` 查询也可以接受 `must` 、 `must_not` 和 `should` 参数下的多个查询语句。((("bool query")))比如: [source,js] -------------------------------------------------- @@ -30,13 +25,9 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/15_Bool_query.json -The results from the preceding query include any document whose `title` field -contains the term `quick`, except for those that also contain `lazy`. So -far, this is pretty similar to how the `bool` filter works. +以上的查询结果返回 `title` 字段包含词项 `quick` 但不包含 `lazy` 的任意文档。目前为止,这与 `bool` 过滤器的工作方式非常相似。 -The difference comes in with the two `should` clauses, which say that: a document -is _not required_ to contain ((("should clause", "in bool queries")))either `brown` or `dog`, but if it does, then -it should be considered _more relevant_: +区别就在于两个 `should` 语句,也就是说:一个文档不必包含((("should clause", "in bool queries"))) `brown` 或 `dog` 这两个词项,但如果一旦包含,我们就认为它们 _更相关_ : [source,js] -------------------------------------------------- @@ -60,28 +51,19 @@ it should be considered _more relevant_: } -------------------------------------------------- -<1> Document 3 scores higher because it contains both `brown` and `dog`. +<1> 文档 3 会比文档 1 有更高评分是因为它同时包含 `brown` 和 `dog` 。 -==== Score Calculation +==== 评分计算 -The `bool` query calculates((("relevance scores", "calculation in bool queries")))((("bool query", "score calculation"))) the relevance `_score` for each document by adding -together the `_score` from all of the matching `must` and `should` clauses, -and then dividing by the total number of `must` and `should` clauses. +`bool` 查询会为每个文档计算相关度评分 `_score` ,((("relevance scores", "calculation in bool queries")))((("bool query", "score calculation")))再将所有匹配的 `must` 和 `should` 语句的分数 `_score` 求和,最后除以 `must` 和 `should` 语句的总数。 -The `must_not` clauses do not affect ((("must_not clause", "in bool queries")))the score; their only purpose is to -exclude documents that might otherwise have been included. +`must_not` 语句不会影响评分;((("must_not clause", "in bool queries")))它的作用只是将不相关的文档排除。 -==== Controlling Precision +==== 控制精度 -All the `must` clauses must match, and all the `must_not` clauses must not -match, but how many `should` clauses((("bool query", "controlling precision")))((("full text search", "combining queries", "controlling precision")))((("precision", "controlling for bool query"))) should match? By default, none of the `should` clauses are required to match, with one -exception: if there are no `must` clauses, then at least one `should` clause -must match. +所有 `must` 语句必须匹配,所有 `must_not` 语句都必须不匹配,但有多少 `should` 语句应该匹配呢?((("bool query", "controlling precision")))((("full text search", "combining queries", "controlling precision")))((("precision", "controlling for bool query")))默认情况下,没有 `should` 语句是必须匹配的,只有一个例外:那就是当没有 `must` 语句的时候,至少有一个 `should` 语句必须匹配。 -Just as we can control the <>, -we can control how many `should` clauses need to match by using the -`minimum_should_match` parameter,((("minimum_should_match parameter", "in bool queries"))) either as an absolute number or as a -percentage: +就像我们能控制 <> 一样,我们可以通过 `minimum_should_match` 参数控制需要匹配的 `should` 语句的数量,((("minimum_should_match parameter", "in bool queries")))它既可以是一个绝对的数字,又可以是个百分比: [source,js] -------------------------------------------------- @@ -101,10 +83,7 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/15_Bool_query.json -<1> This could also be expressed as a percentage. - -The results would include only documents whose `title` field contains `"brown" -AND "fox"`, `"brown" AND "dog"`, or `"fox" AND "dog"`. If a document contains -all three, it would be considered more relevant than those that contain -just two of the three. +<1> 这也可以用百分比表示。 +这个查询结果会将所有满足以下条件的文档返回: `title` 字段包含 `"brown" +AND "fox"` 、 `"brown" AND "dog"` 或 `"fox" AND "dog"` 。如果有文档包含所有三个条件,它会比只包含两个的文档更相关。 From db0895bd588fe4d448ce67a218354f8dd7f4795a Mon Sep 17 00:00:00 2001 From: chenryn Date: Tue, 15 Mar 2016 09:05:53 +0800 Subject: [PATCH 30/95] chapter45_part1:/500_Cluster_Admin/10_intro.asciidoc --- 500_Cluster_Admin/10_intro.asciidoc | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/500_Cluster_Admin/10_intro.asciidoc b/500_Cluster_Admin/10_intro.asciidoc index e9517685d..32729c0c5 100644 --- a/500_Cluster_Admin/10_intro.asciidoc +++ b/500_Cluster_Admin/10_intro.asciidoc @@ -1,15 +1,6 @@ -Elasticsearch is often deployed as a cluster of nodes.((("clusters", "administration"))) A variety of -APIs let you manage and monitor the cluster itself, rather than interact -with the data stored within the cluster. +Elasticsearch 经常以多节点集群的方式部署。((("clusters", "administration")))有多种 API 让你可以管理和监控集群本身,而不用和集群里存储的数据打交道。 -As with most functionality in Elasticsearch, there is an overarching design goal -that tasks should be performed through an API rather than by modifying static -configuration files. This becomes especially important as your cluster scales. -Even with a provisioning system (such as Puppet, Chef, and Ansible), a single HTTP API call -is often simpler than pushing new configurations to hundreds of physical machines. +和 Elasticsearch 里绝大多数功能一样,我们有一个总体的设计目标,即任务应该通过 API 执行,而不是通过修改静态的配置文件。这一点在你的集群扩容时尤为重要。即便通过配置管理系统(比如 Puppet,Chef 或者 Ansible),一个简单的 HTTP API 调用,也比往上百台物理设备上推送新配置文件简单多了。 -To that end, this chapter presents the various APIs that allow you to -dynamically tweak, tune, and configure your cluster. It also covers a -host of APIs that provide statistics about the cluster itself so you can -monitor for health and performance. +因此,本章将介绍各种可以让你动态调整、调优和调配集群的 API。同时,还会介绍一系列提供集群自身统计数据的 API,你可以用这些接口来监控集群健康状态和性能。 From dae095b5eaf437ab33c26f43b61a9dc1e26bd6b9 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Wed, 30 Mar 2016 17:37:32 +0800 Subject: [PATCH 31/95] chapter46_part2: /510_Deployment/45_dont_touch.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 修改一些问题 @biyuhao 🙏 --- 510_Deployment/45_dont_touch.asciidoc | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/510_Deployment/45_dont_touch.asciidoc b/510_Deployment/45_dont_touch.asciidoc index c25cc5187..45721760f 100644 --- a/510_Deployment/45_dont_touch.asciidoc +++ b/510_Deployment/45_dont_touch.asciidoc @@ -1,7 +1,7 @@ === 不要触碰这些配置! -在 Elasticsearch 中有一些热点,人们可能不可避免的会碰到。我们理解的,所有的调整就是为了优化,但是这些调整,你真的不需要理会它。因为它们经常会被乱用,从而造成系统的不稳定或者糟糕的性能,甚至两者都有可能。 +在 Elasticsearch 中有一些热点,人们可能不可避免的会碰到。((("deployment", "settings to leave unaltered"))) 我们理解的,所有的调整就是为了优化,但是这些调整,你真的不需要理会它。因为它们经常会被乱用,从而造成系统的不稳定或者糟糕的性能,甚至两者都有可能。 ==== 垃圾回收器 @@ -16,35 +16,34 @@ Elasticsearch 默认的垃圾回收器( GC )是 CMS。((("Concurrent-Mark an 尽管有这些缺点,它还是目前对于像 Elasticsearch 这样低延迟需求软件的最佳垃圾回收器。官方建议使用 CMS。 现在有一款新的垃圾回收器,叫 G1 垃圾回收器( G1GC )。((("Garbage First GC (G1GC)"))) 这款新的 GC 被设计,旨在比 CMS 更小的暂停时间,以及对大内存的处理能力。 -它的原理是把内存分成许多区域,并且预测哪些区域最有可能需要回收内存。( _G1GC_ )通过收集这些区域,产生更小的暂停时间,从而能应对更大的内存。 +它的原理是把内存分成许多区域,并且预测哪些区域最有可能需要回收内存。通过优先收集这些区域( _garbage first_ ),产生更小的暂停时间,从而能应对更大的内存。 -听起来很棒!遗憾的是,G1GC 还是太新了,经常发现新的 bugs。这些错误通常是分段错误的类型,便造成硬盘的崩溃。 -Lucene 的测试套件对 GC 是很严格的,看起来这些缺陷 G1GC 并没有很好地解决。 +听起来很棒!遗憾的是,G1GC 还是太新了,经常发现新的 bugs。这些错误通常是段( segfault )类型,便造成硬盘的崩溃。 +Lucene 的测试套件对垃圾回收算法要求严格,看起来这些缺陷 G1GC 并没有很好地解决。 我们很希望在将来某一天推荐使用 G1GC,但是对于现在,它还不能足够稳定的满足 Elasticsearch 和 Lucene 的要求。 ==== 线程池 -许多人 _喜欢_ 调整线程池。((("threadpools"))) 无论什么原因,人们好像都无法抵挡的想增加线程数。索引太多了?增加线程!搜索太多了?增加线程!节点空闲率低于 95%?增加线程! +许多人 _喜欢_ 调整线程池。((("threadpools"))) 无论什么原因,人们都对增加线程数无法抵抗。索引太多了?增加线程!搜索太多了?增加线程!节点空闲率低于 95%?增加线程! Elasticsearch 默认的线程设置已经是很合理的了。对于所有的线程池(除了 `搜索` ),线程个数是根据 CPU 核心数设置的。 如果你有 8 个核,你可以同时运行的只有 8 个线程,只分配 8 个线程给任何特定的线程池是有道理的。 搜索线程池设置的大一点,配置为 `int(( 核心数 * 3 )/ 2 )+ 1` 。 -你可能会认为某些线程可能会阻塞(如磁盘上的 I/O 操作),所以你才想加大线程的。这并不是 Elasticsearch 的一个问题: -因为大多数 I/O 的操作是由 Lucene 线程管理的,而不是 Elasticsearch。 +你可能会认为某些线程可能会阻塞(如磁盘上的 I/O 操作),所以你才想加大线程的。对于 Elasticsearch 来说这并不是一个问题:因为大多数 I/O 的操作是由 Lucene 线程管理的,而不是 Elasticsearch。 此外,线程池通过传递彼此之间的工作配合。你不必再因为它正在等待磁盘写操作而担心网络线程阻塞, 因为网络线程早已把这个工作交给另外的线程池,并且网络进行了响应。 -最后,你的处理器的计算容量是有限的,拥有更多的线程会导致你的处理器频繁切换线程上下文。 +最后,你的处理器的计算能力是有限的,拥有更多的线程会导致你的处理器频繁切换线程上下文。 一个处理器同时只能运行一个线程。所以当它需要切换到其它不同的线程的时候,它会存储当前的状态(寄存器等等),然后加载另外一个线程。 如果幸运的话,这个切换发生在同一个核心,如果不幸的话,这个切换可能发生在不同的核心,这就需要在内核间总线上进行传输。 -这个上下文的切换,会循环的带来管理调度开销;在现代的 CPUs 上,开销估计高达 30 μs。也就是说线程会被堵塞超过 30 μs,如果这个时间用于线程的运行,极有可能早就结束了。 +这个上下文的切换,会给 CPU 时钟周期带来管理调度的开销;在现代的 CPUs 上,开销估计高达 30 μs。也就是说线程会被堵塞超过 30 μs,如果这个时间用于线程的运行,极有可能早就结束了。 -人们经常稀里糊涂的设置线程池的值。8 个核的 CUP,我们遇到过有人配了 60、100 甚至 1000 个线程。 +人们经常稀里糊涂的设置线程池的值。8 个核的 CPU,我们遇到过有人配了 60、100 甚至 1000 个线程。 这些设置只会让 CPU 实际工作效率更低。 所以,下次请不要调整线程池的线程数。如果你真 _想调整_ , From 46a366b9f9a2c9e2f7b666e81ee74330f7261aef Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Thu, 31 Mar 2016 10:05:46 +0800 Subject: [PATCH 32/95] chapter46_part6: /510_Deployment/50_heap.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 之前的 50_heap.asciidoc 已被 merged ,重提 pr 修改 cluster 的翻译笔误,“群集”修改为“集群” --- 510_Deployment/50_heap.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 8e8105bba..dc5073915 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -2,7 +2,7 @@ === 堆内存:大小和交换 Elasticsearch 默认安装后设置的堆内存是 1 GB。((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting")))对于任何一个业务部署来说, -这个设置都太小了。如果你正在使用这些默认堆内存配置,您的群集可能会出现问题。 +这个设置都太小了。如果你正在使用这些默认堆内存配置,您的集群可能会出现问题。 这里有两种方式修改 Elasticsearch 的堆内存。最简单的一个方法就是指定 `ES_HEAP_SIZE` 环境变量。((("ES_HEAP_SIZE environment variable")))服务进程在启动时候会读取这个变量,并相应的设置堆的大小。 比如,你可以用下面的命令设置它: From 6fd900576561435795bb06e7f8efdd368adc5095 Mon Sep 17 00:00:00 2001 From: michealzh Date: Mon, 4 Apr 2016 10:43:10 +0800 Subject: [PATCH 33/95] =?UTF-8?q?020=5FDistributed=5FCluster/00=5FIntro.as?= =?UTF-8?q?ciidoc=20=E7=BF=BB=E8=AF=91?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 020_Distributed_Cluster/00_Intro.asciidoc | 39 ++++++++--------------- 1 file changed, 14 insertions(+), 25 deletions(-) diff --git a/020_Distributed_Cluster/00_Intro.asciidoc b/020_Distributed_Cluster/00_Intro.asciidoc index cc998ab38..cf6760716 100644 --- a/020_Distributed_Cluster/00_Intro.asciidoc +++ b/020_Distributed_Cluster/00_Intro.asciidoc @@ -1,36 +1,25 @@ [[distributed-cluster]] -== Life Inside a Cluster +== 集群内的生活 -.Supplemental Chapter +.补充章节 **** -As mentioned earlier, this is the first of several supplemental chapters -about how Elasticsearch operates in a distributed((("clusters"))) environment. In this -chapter, we explain commonly used terminology like _cluster_, _node_, and -_shard_, the mechanics of how Elasticsearch scales out, and how it deals with -hardware failure. +正如前面提到的,这是第一个关于Elasticsearch在分布式((("集群")))环境中是如何运作的几个补充章节。 +在本章中,我们将介绍常用的术语,如 _集群_,_节点_ 和 _碎片_ ,Elasticsearch如何横向扩展的机制,以及它如何处理硬件故障。 -Although this chapter is not required reading--you can use Elasticsearch for -a long time without worrying about shards, replication, and failover--it will -help you to understand the processes at work inside Elasticsearch. Feel free -to skim through the chapter and to refer to it again later. +尽管本章不是必读的-您可以使用Elasticsearch很长一段时间,而不用担心碎片,复制和故障转移-它将帮助您了解Elasticsearch内部工作的流程。 +浏览本章稍后再次参阅。 **** -Elasticsearch is built to be ((("scalability, Elasticsearch and")))always available, and to scale with your needs. -Scale can come from buying bigger ((("vertical scaling, Elasticsearch and")))servers (_vertical scale_, or _scaling up_) -or from buying more ((("horizontal scaling, Elasticsearch and")))servers (_horizontal scale_, or _scaling out_). +Elasticsearch构建为((("scalability, Elasticsearch and")))始终可用并可根据您的需求扩展。 +扩展可以购买更大的((("vertical scaling, Elasticsearch and")))服务器 (_垂直扩展_, 或 _纵向扩展_) +或者购买更多的((("horizontal scaling, Elasticsearch and")))服务器 (_水平扩展_, 或 _横向扩展_). -While Elasticsearch can benefit from more-powerful hardware, vertical scale -has its limits. Real scalability comes from horizontal scale--the ability to -add more nodes to the cluster and to spread load and reliability between them. +Elasticsearch可以受益于更强大的硬件,但垂直扩展有其局限性。 +真正的可伸缩性来自水平横向扩容-能够添加更多的节点到集群来分散它们之间负载和可用性的能力。 -With most databases, scaling horizontally usually requires a major overhaul of -your application to take advantage of these extra boxes. In contrast, -Elasticsearch is _distributed_ by nature: it knows how to manage multiple -nodes to provide scale and high availability. This also means that your -application doesn't need to care about it. +对于大多数数据库,水平扩展通常需要您的应用程序进行额外的修改才能充分利用这些额外的数据库。 +相比之下,Elasticsearch 原生支持 _分布式_ :它知道如何管理多个节点来提供可伸缩和高可用性。这意味着你的应用程序不需要去关心它。 -In this chapter, we show how you can set up your cluster, -nodes, and shards to scale with your needs and to ensure that your data is -safe from hardware failure. +在本章中,我们向您展示如何设置集群,节点和分片来按需扩容并保证硬件故障中的数据安全。 From e4e86b9f58fa8d6fe2dbd2d1e4f534561378ea24 Mon Sep 17 00:00:00 2001 From: pengqiuyuan Date: Tue, 12 Apr 2016 11:21:57 +0800 Subject: [PATCH 34/95] chapter46_part4: /510_Deployment/40_config.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 语句调整。 @medcl @biyuhao 🙏 --- 510_Deployment/40_config.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 2cecbdb34..5d3183c0f 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -15,8 +15,8 @@ Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configurati 在 Elasticsearch 中很少有“神奇的配置项”, 如果存在,我们也已经帮你优化了! -也就是说,有些配置在生成环境中是应该调整的。 -这些变化会让你的生活更轻松,因为没有办法设定好的默认值(它取决于你的集群布局)。 +另外,有些 _逻辑上的_ 配置在生产环境中是应该调整的。 +这些调整可能会让你的工作更加轻松,又或者因为没办法设定一个默认值(它取决于你的集群布局)。 ==== 指定名字 @@ -138,11 +138,11 @@ PUT /_cluster/settings 当你集群重启时,几个配置项影响你的分片恢复的表现。((("recovery settings")))((("configuration changes, important", "recovery settings")))首先,我们需要明白如果什么也没配置将会发生什么。 -想象一下假设你有 10 个节点,每个节点只保存一个分片,这个分片是一个主分片或者是一个分片副本,或者说有一个有 5 个主分片/1 个分片副本的索引。有时你需要为整个集群做离线维护(比如,为了安装一个新的驱动程序), +想象一下假设你有 10 个节点,每个节点只保存一个分片,这个分片是一个主分片或者是一个副本分片,或者说有一个有 5 个主分片/1 个副本分片的索引。有时你需要为整个集群做离线维护(比如,为了安装一个新的驱动程序), 当你重启你的集群,恰巧出现了 5 个节点已经启动,还有 5 个还没启动的场景。 假设其它 5 个节点出问题,或者他们根本没有收到立即重启的命令。不管什么原因,你有 5 个节点在线上,这五个节点会相互通信,选出一个 master,从而形成一个集群。 -他们注意到数据不再均匀分布,因为有 5 个节点在集群中丢失了,所以他们之间会立马启动分片复制。 +他们注意到数据不再均匀分布,因为有 5 个节点在集群中丢失了,所以他们之间会立即启动分片复制。 最后,你的其它 5 个节点打开加入了集群。这些节点会发现 _它们_ 的数据正在被复制到其他节点,所以他们删除本地数据(因为这份数据要么是多余的,要么是过时的)。 然后整个集群重新进行平衡,因为集群的大小已经从 5 变成了 10。 From c9eed3779d2bbaefd248c568ede479ee69b2dc87 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Thu, 14 Apr 2016 07:38:59 +0800 Subject: [PATCH 35/95] chapter14_part1: /110_Multi_Field_Search/00_Intro.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 初译 --- 110_Multi_Field_Search/00_Intro.asciidoc | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/110_Multi_Field_Search/00_Intro.asciidoc b/110_Multi_Field_Search/00_Intro.asciidoc index d6090dd5e..eced1d324 100644 --- a/110_Multi_Field_Search/00_Intro.asciidoc +++ b/110_Multi_Field_Search/00_Intro.asciidoc @@ -1,17 +1,8 @@ [[multi-field-search]] -== Multifield Search +== 多字段搜索 -Queries are seldom simple one-clause `match` queries. ((("multifield search"))) We frequently need to -search for the same or different query strings in one or more fields, which -means that we need to be able to combine multiple query clauses and their -relevance scores in a way that makes sense. +查询很少是简单一句话的 `match` 匹配查询。((("multifield search")))通常我们需要用相同或不同的字符串查询一个或多个字段,也就是说,需要对多个查询语句以及它们相关度评分进行合理的合并。 -Perhaps we're looking for a book called _War and Peace_ by an author called -Leo Tolstoy. Perhaps we're searching the Elasticsearch documentation -for ``minimum should match,'' which might be in the title or the body of a -page. Or perhaps we're searching for users with first name John and last -name Smith. +有时候或许我们正查找作者 Leo Tolstoy 写的一本名为 _War and Peace_(战争与和平)的书。或许我们正用 "minimum should match" (最少应该匹配)的方式在文档中对标题或页面内容进行搜索,或许我们正在搜索所有名字为 John Smith 的用户。 -In this chapter, we present the available tools for constructing multiclause -searches and how to figure out which solution you should apply to your -particular use case. +在本章,我们会介绍构造多语句搜索的工具及在特定场景下应该采用的解决方案。 From bf6351fec740475e4ab9c7abb9a11a2c77b8866f Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Thu, 14 Apr 2016 08:58:43 +0800 Subject: [PATCH 36/95] fix English quota fix English quota --- 110_Multi_Field_Search/00_Intro.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/110_Multi_Field_Search/00_Intro.asciidoc b/110_Multi_Field_Search/00_Intro.asciidoc index eced1d324..a0ea2ea0c 100644 --- a/110_Multi_Field_Search/00_Intro.asciidoc +++ b/110_Multi_Field_Search/00_Intro.asciidoc @@ -3,6 +3,6 @@ 查询很少是简单一句话的 `match` 匹配查询。((("multifield search")))通常我们需要用相同或不同的字符串查询一个或多个字段,也就是说,需要对多个查询语句以及它们相关度评分进行合理的合并。 -有时候或许我们正查找作者 Leo Tolstoy 写的一本名为 _War and Peace_(战争与和平)的书。或许我们正用 "minimum should match" (最少应该匹配)的方式在文档中对标题或页面内容进行搜索,或许我们正在搜索所有名字为 John Smith 的用户。 +有时候或许我们正查找作者 Leo Tolstoy 写的一本名为 _War and Peace_(战争与和平)的书。或许我们正用 “minimum should match” (最少应该匹配)的方式在文档中对标题或页面内容进行搜索,或许我们正在搜索所有名字为 John Smith 的用户。 在本章,我们会介绍构造多语句搜索的工具及在特定场景下应该采用的解决方案。 From fab2c225b378d87e786cdea6a2acc6d0021acca8 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Mon, 18 Apr 2016 16:39:31 +0800 Subject: [PATCH 37/95] chapter17_part1: /130_Partial_Matching/05_Intro.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 初译 --- 170_Relevance/05_Intro.asciidoc | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-) diff --git a/170_Relevance/05_Intro.asciidoc b/170_Relevance/05_Intro.asciidoc index 6b13bc923..9c6d59bfb 100644 --- a/170_Relevance/05_Intro.asciidoc +++ b/170_Relevance/05_Intro.asciidoc @@ -1,30 +1,14 @@ [[controlling-relevance]] -== Controlling Relevance +== 控制相关度 -Databases that deal purely in structured data (such as dates, numbers, and -string enums) have it easy: they((("relevance", "controlling"))) just have to check whether a document (or a -row, in a relational database) matches the query. +处理结构化数据(比如:时间、数字、字符串、枚举)的数据库,((("relevance", "controlling")))只需检查文档(或关系数据库里的行)是否与查询匹配。 -While Boolean yes/no matches are an essential part of full-text search, they -are not enough by themselves. Instead, we also need to know how relevant each -document is to the query. Full-text search engines have to not only find the -matching documents, but also sort them by relevance. +布尔的是/非匹配是全文搜索的基础,但不止如此,我们还要知道每个文档与查询的相关度,在全文搜索引擎中不仅需要找到匹配的文档,还需根据它们相关度的高低进行排序。 -Full-text relevance ((("similarity algorithms")))formulae, or _similarity algorithms_, combine several -factors to produce a single relevance `_score` for each document. In this -chapter, we examine the various moving parts and discuss how they can be -controlled. +全文相关的公式或 _相似算法(similarity algorithms)_ ((("similarity algorithms")))会将多个因素合并起来,为每个文档生成一个相关度评分 `_score` 。本章中,我们会验证各种可变部分,然后讨论如何来控制它们。 -Of course, relevance is not just about full-text queries; it may need to -take structured data into account as well. Perhaps we are looking for a -vacation home with particular features (air-conditioning, sea view, free -WiFi). The more features that a property has, the more relevant it is. Or -perhaps we want to factor in sliding scales like recency, price, popularity, or -distance, while still taking the relevance of a full-text query into account. +当然,相关度不只与全文查询有关,也需要将结构化的数据考虑其中。可能我们正在找一个度假屋,需要一些的详细特征(空调、海景、免费WiFi),匹配的特征越多相关度越高。可能我们还希望有一些其他的考虑因素,如回头率、价格、受欢迎度或距离,当然也同时考虑全文查询的相关度。 -All of this is possible thanks to the powerful scoring infrastructure -available in Elasticsearch. +所有的这些都可以通过 Elasticsearch 强大的评分基础来实现。 -We will start by looking at the theoretical side of how Lucene calculates -relevance, and then move on to practical examples of how you can control the -process. +本章会先从理论上介绍 Lucene 是如何计算相关度的,然后通过实际例子说明如何控制相关度的计算过程的。 From 45dcc5d0e42e82f1ed2919544300e52aa35d08c7 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Fri, 22 Apr 2016 16:07:46 +0800 Subject: [PATCH 38/95] chapter17_part13: /170_Relevance/65_Script_score.asciidoc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 初译 --- 170_Relevance/65_Script_score.asciidoc | 76 ++++++++------------------ 1 file changed, 24 insertions(+), 52 deletions(-) diff --git a/170_Relevance/65_Script_score.asciidoc b/170_Relevance/65_Script_score.asciidoc index c6ca3babf..6b202ec1f 100644 --- a/170_Relevance/65_Script_score.asciidoc +++ b/170_Relevance/65_Script_score.asciidoc @@ -1,22 +1,15 @@ [[script-score]] -=== Scoring with Scripts +=== 脚本评分 -Finally, if none of the `function_score`'s built-in functions suffice, you can -implement the logic that you need with a script, using the `script_score` -function.((("function_score query", "using script_score function")))((("script_score function")))((("relevance", "controlling", "scoring with scripts"))) +最后,如果所有 `function_score` 内置的函数都无法满足应用场景,可以使用 `script_score` 函数自行实现逻辑。((("function_score query", "using script_score function")))((("script_score function")))((("relevance", "controlling", "scoring with scripts"))) -For an example, let's say that we want to factor our profit margin into the -relevance calculation. In our business, the profit margin depends on three -factors: +举个例子,想将利润空间作为因子加入到相关度评分计算,在业务中,利润空间和以下三点相关: -* The `price` per night of the vacation home. -* The user's membership level--some levels get a percentage `discount` - above a certain price per night `threshold`. -* The negotiated `margin` as a percentage of the price-per-night, after user - discounts. +* `price` 度假屋每晚的价格。 +* 会员用户的级别——某些等级的用户可以在每晚房价高于某个 `threshold` 阀值价格的时候享受折扣 `discount` 。 +* 用户享受折扣后,经过议价的每晚房价的利润 `margin` 。 -The algorithm that we will use to calculate the profit for each home is as -follows: +计算每个度假屋利润的算法如下: [source,groovy] ------------------------- @@ -27,11 +20,8 @@ if (price < threshold) { } ------------------------- -We probably don't want to use the absolute profit as a score; it would -overwhelm the other factors like location, popularity and features. Instead, -we can express the profit as a percentage of our `target` profit. A profit -margin above our target will have a positive score (greater than `1.0`), and a profit margin below our target will have a negative score (less than -`1.0`): +我们很可能不想用绝对利润作为评分,这会弱化其他如地点、受欢迎度和特性等因子的作用,而是将利润用目标利润 `target` 的百分比来表示,高于 +目标的利润空间会有一个正向评分(大于 `1.0` ),低于目标的利润空间会有一个负向分数(小于 `1.0` ): [source,groovy] ------------------------- @@ -43,9 +33,7 @@ if (price < threshold) { return profit / target ------------------------- -The default scripting language in Elasticsearch is -http://groovy.codehaus.org/[Groovy], which for the most part looks a lot like -JavaScript.((("Groovy", "script factoring profit margins into relevance calculations"))) The preceding algorithm as a Groovy script would look like this: +Elasticsearch 里使用 http://groovy.codehaus.org/[Groovy] 作为默认的脚本语言,它与JavaScript很像,((("Groovy", "script factoring profit margins into relevance calculations")))上面这个算法用 Groovy 脚本表示如下: [source,groovy] ------------------------- @@ -57,13 +45,10 @@ if (price < threshold) { <2> } return price * (1 - discount) * margin / target <2> ------------------------- -<1> The `price` and `margin` variables are extracted from the `price` and - `margin` fields in the document. -<2> The `threshold`, `discount`, and `target` variables we will pass in as - `params`. +<1> `price` 和 `margin` 变量可以分别从文档的 `price` 和 `margin` 字段提取。 +<2> `threshold` 、 `discount` 和 `target` 是作为参数 `params` 传入的。 -Finally, we can add our `script_score` function to the list of other functions -that we are already using: +最终我们将 `script_score` 函数与其他函数一起使用: [source,json] ------------------------- @@ -80,7 +65,7 @@ GET /_search "discount": 0.1, "target": 10 }, - "script": "price = doc['price'].value; margin = doc['margin'].value; + "script": "price = doc['price'].value; margin = doc['margin'].value; if (price < threshold) { return price * margin / target }; return price * (1 - discount) * margin / target;" <3> } @@ -89,35 +74,22 @@ GET /_search } } ------------------------- -<1> The `location` and `price` clauses refer to the example explained in - <>. -<2> By passing in these variables as `params`, we can change their values - every time we run this query without having to recompile the script. -<3> JSON cannot include embedded newline characters. Newline characters in - the script should either be escaped as `\n` or replaced with semicolons. +<1> `location` 和 `price` 语句在 <> 中解释过。 +<2> 将这些变量作为参数 `params` 传递,我们可以查询时动态改变脚本无须重新编译。 +<3> JSON 不能接受内嵌的换行符,脚本中的换行符可以用 `\n` 或 `;` 符号替代。 -This query would return the documents that best satisfy the user's -requirements for location and price, while still factoring in our need to make -a profit. +这个查询根据用户对地点和价格的需求,返回用户最满意的文档,同时也考虑到我们对于盈利的要求。 [TIP] ======================================== -The `script_score` function provides enormous flexibility.((("scripts", "performance and"))) Within a script, -you have access to the fields of the document, to the current `_score`, and -even to the term frequencies, inverse document frequencies, and field length -norms (see {ref}/modules-advanced-scripting.html[Text scoring in scripts]). +`script_score` 函数提供了巨大的灵活性,((("scripts", "performance and")))可以通过脚本访问文档里的所有字段、当前评分 `_score` 甚至词频、逆向文档频率和字段长度规范值这样的信息(参见 see {ref}/modules-advanced-scripting.html[脚本对文本评分])。 -That said, scripts can have a performance impact. If you do find that your -scripts are not quite fast enough, you have three options: +有人说使用脚本对性能会有影响,如果确实发现脚本执行较慢,可以有以下三种选择: -* Try to precalculate as much information as possible and include it in each - document. -* Groovy is fast, but not quite as fast as Java.((("Java", "scripting in"))) You could reimplement your - script as a native Java script. (See - {ref}/modules-scripting.html#native-java-scripts[Native Java Scripts]). -* Use the `rescore` functionality((("rescoring"))) described in <> to apply - your script to only the best-scoring documents. +* 尽可能多的提前计算各种信息并将结果存入每个文档中。 +* Groovy 很快,但没 Java 快。((("Java", "scripting in")))可以将脚本用原生的 Java 脚本重新实现。(参见 + {ref}/modules-scripting.html#native-java-scripts[原生 Java 脚本])。 +* 仅对那些最佳评分的文档应用脚本,使用 <> 中提到的 `rescore` 功能。((("rescoring"))) ======================================== - From b51b446179f28c2972ced917127bfd01c4f75e68 Mon Sep 17 00:00:00 2001 From: Takuya Date: Wed, 18 May 2016 01:00:11 +0800 Subject: [PATCH 39/95] chapter4_part1 040_Distributed_CRUD/00_Intro.asciidoc --- 040_Distributed_CRUD/00_Intro.asciidoc | 27 ++++++++++++-------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/040_Distributed_CRUD/00_Intro.asciidoc b/040_Distributed_CRUD/00_Intro.asciidoc index b849ce9db..23111fb6b 100644 --- a/040_Distributed_CRUD/00_Intro.asciidoc +++ b/040_Distributed_CRUD/00_Intro.asciidoc @@ -1,25 +1,22 @@ [[distributed-docs]] -== Distributed Document Store +== 分布式文档存储 -In the preceding chapter, we looked at all the ways to put data into your index and -then retrieve it. But we glossed over many technical details surrounding how -the data is distributed and fetched from the cluster. This separation is done -on purpose; you don't really need to know how data is distributed to work -with Elasticsearch. It just works. +在前面的章节,我们介绍了如何索引和查询数据,不过我们忽略了很多底层的技术细节, +例如文件是如何分布到集群的,又是如何从集群中获取的。 +Elasticsearch 本意就是隐藏这些底层细节,让我们好专注在业务开发中,所以其实你不必了解这么深入也无妨。 -In this chapter, we dive into those internal, technical details -to help you understand how your data is stored in a distributed system. +在这个章节中,我们将深入探索这些核心的技术细节,这能帮助你更好地理解数据如何被存储到这个分布式系统中。 -.Content Warning + +.警告 **** -The information presented in this chapter is for your interest. You are not required to -understand and remember all the detail in order to use Elasticsearch. The -options that are discussed are for advanced users only. +这个章节包含了一些高级话题,上面也提到过,就算你不记住和理解所有的细节仍然能正常使用 Elasticsearch。 +如果你有兴趣的话,这个章节可以作为你的课外兴趣读物,扩展你的知识面。 -Read the section to gain a taste for how things work, and to know where the -information is in case you need to refer to it in the future, but don't be -overwhelmed by the detail. +如果你在阅读这个章节的时候感到很吃力,也不用担心。 +这个章节仅仅只是用来告诉你 Elasticsearch 是如何工作的, +将来在工作中如果你需要用到这个章节提供的知识,可以再回过头来翻阅。 **** From 07db8c8f60341e5c32844686d6233b19e7c62b8b Mon Sep 17 00:00:00 2001 From: Takuya Date: Wed, 18 May 2016 01:01:37 +0800 Subject: [PATCH 40/95] chapter4_part1 040_Distributed_CRUD/00_Intro.asciidoc --- 040_Distributed_CRUD/00_Intro.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/040_Distributed_CRUD/00_Intro.asciidoc b/040_Distributed_CRUD/00_Intro.asciidoc index 23111fb6b..320298d48 100644 --- a/040_Distributed_CRUD/00_Intro.asciidoc +++ b/040_Distributed_CRUD/00_Intro.asciidoc @@ -8,7 +8,7 @@ Elasticsearch 本意就是隐藏这些底层细节,让我们好专注在业务 在这个章节中,我们将深入探索这些核心的技术细节,这能帮助你更好地理解数据如何被存储到这个分布式系统中。 -.警告 +.注意 **** 这个章节包含了一些高级话题,上面也提到过,就算你不记住和理解所有的细节仍然能正常使用 Elasticsearch。 From 31461236dd36c896af771a093f6fda00fb9499c3 Mon Sep 17 00:00:00 2001 From: richardwei2008 Date: Fri, 20 May 2016 14:01:40 +0800 Subject: [PATCH 41/95] =?UTF-8?q?=E4=BF=AE=E6=94=B9?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 1. WiFi 处空格 2. 如何控制相关度的计算过程的 》 去掉‘的’ --- 170_Relevance/05_Intro.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/170_Relevance/05_Intro.asciidoc b/170_Relevance/05_Intro.asciidoc index 9c6d59bfb..f304f2147 100644 --- a/170_Relevance/05_Intro.asciidoc +++ b/170_Relevance/05_Intro.asciidoc @@ -7,8 +7,8 @@ 全文相关的公式或 _相似算法(similarity algorithms)_ ((("similarity algorithms")))会将多个因素合并起来,为每个文档生成一个相关度评分 `_score` 。本章中,我们会验证各种可变部分,然后讨论如何来控制它们。 -当然,相关度不只与全文查询有关,也需要将结构化的数据考虑其中。可能我们正在找一个度假屋,需要一些的详细特征(空调、海景、免费WiFi),匹配的特征越多相关度越高。可能我们还希望有一些其他的考虑因素,如回头率、价格、受欢迎度或距离,当然也同时考虑全文查询的相关度。 +当然,相关度不只与全文查询有关,也需要将结构化的数据考虑其中。可能我们正在找一个度假屋,需要一些的详细特征(空调、海景、免费 WiFi ),匹配的特征越多相关度越高。可能我们还希望有一些其他的考虑因素,如回头率、价格、受欢迎度或距离,当然也同时考虑全文查询的相关度。 所有的这些都可以通过 Elasticsearch 强大的评分基础来实现。 -本章会先从理论上介绍 Lucene 是如何计算相关度的,然后通过实际例子说明如何控制相关度的计算过程的。 +本章会先从理论上介绍 Lucene 是如何计算相关度的,然后通过实际例子说明如何控制相关度的计算过程。 From d08e689e193976327ab05258a14f2f50ff463a8b Mon Sep 17 00:00:00 2001 From: Medcl Date: Tue, 31 May 2016 07:50:14 +0200 Subject: [PATCH 42/95] =?UTF-8?q?Revert=20"020=5FDistributed=5FCluster/00?= =?UTF-8?q?=5FIntro.asciidoc=20=E7=BF=BB=E8=AF=91"?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 020_Distributed_Cluster/00_Intro.asciidoc | 39 +++++++++++++++-------- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/020_Distributed_Cluster/00_Intro.asciidoc b/020_Distributed_Cluster/00_Intro.asciidoc index cf6760716..cc998ab38 100644 --- a/020_Distributed_Cluster/00_Intro.asciidoc +++ b/020_Distributed_Cluster/00_Intro.asciidoc @@ -1,25 +1,36 @@ [[distributed-cluster]] -== 集群内的生活 +== Life Inside a Cluster -.补充章节 +.Supplemental Chapter **** -正如前面提到的,这是第一个关于Elasticsearch在分布式((("集群")))环境中是如何运作的几个补充章节。 -在本章中,我们将介绍常用的术语,如 _集群_,_节点_ 和 _碎片_ ,Elasticsearch如何横向扩展的机制,以及它如何处理硬件故障。 +As mentioned earlier, this is the first of several supplemental chapters +about how Elasticsearch operates in a distributed((("clusters"))) environment. In this +chapter, we explain commonly used terminology like _cluster_, _node_, and +_shard_, the mechanics of how Elasticsearch scales out, and how it deals with +hardware failure. -尽管本章不是必读的-您可以使用Elasticsearch很长一段时间,而不用担心碎片,复制和故障转移-它将帮助您了解Elasticsearch内部工作的流程。 -浏览本章稍后再次参阅。 +Although this chapter is not required reading--you can use Elasticsearch for +a long time without worrying about shards, replication, and failover--it will +help you to understand the processes at work inside Elasticsearch. Feel free +to skim through the chapter and to refer to it again later. **** -Elasticsearch构建为((("scalability, Elasticsearch and")))始终可用并可根据您的需求扩展。 -扩展可以购买更大的((("vertical scaling, Elasticsearch and")))服务器 (_垂直扩展_, 或 _纵向扩展_) -或者购买更多的((("horizontal scaling, Elasticsearch and")))服务器 (_水平扩展_, 或 _横向扩展_). +Elasticsearch is built to be ((("scalability, Elasticsearch and")))always available, and to scale with your needs. +Scale can come from buying bigger ((("vertical scaling, Elasticsearch and")))servers (_vertical scale_, or _scaling up_) +or from buying more ((("horizontal scaling, Elasticsearch and")))servers (_horizontal scale_, or _scaling out_). -Elasticsearch可以受益于更强大的硬件,但垂直扩展有其局限性。 -真正的可伸缩性来自水平横向扩容-能够添加更多的节点到集群来分散它们之间负载和可用性的能力。 +While Elasticsearch can benefit from more-powerful hardware, vertical scale +has its limits. Real scalability comes from horizontal scale--the ability to +add more nodes to the cluster and to spread load and reliability between them. -对于大多数数据库,水平扩展通常需要您的应用程序进行额外的修改才能充分利用这些额外的数据库。 -相比之下,Elasticsearch 原生支持 _分布式_ :它知道如何管理多个节点来提供可伸缩和高可用性。这意味着你的应用程序不需要去关心它。 +With most databases, scaling horizontally usually requires a major overhaul of +your application to take advantage of these extra boxes. In contrast, +Elasticsearch is _distributed_ by nature: it knows how to manage multiple +nodes to provide scale and high availability. This also means that your +application doesn't need to care about it. -在本章中,我们向您展示如何设置集群,节点和分片来按需扩容并保证硬件故障中的数据安全。 +In this chapter, we show how you can set up your cluster, +nodes, and shards to scale with your needs and to ensure that your data is +safe from hardware failure. From 89e02f93b1c68f86d78dc149c531ebdd682badd3 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 15:48:34 +0800 Subject: [PATCH 43/95] Update 85_Sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 96 +++++++++++++++------------------ 1 file changed, 42 insertions(+), 54 deletions(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 28d1e7977..9f682cafe 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -1,21 +1,20 @@ [[sorting]] -== Sorting and Relevance +== 排序与相关性 -By default, results are returned sorted by _relevance_—with the most -relevant docs first.((("sorting", "by relevance")))((("relevance", "sorting results by"))) Later in this chapter, we explain what we mean by -_relevance_ and how it is calculated, but let's start by looking at the `sort` -parameter and how to use it. -=== Sorting +默认的是,返回的结果是按照_相关性_进行排序的—相关性最强的文档在最前。((("sorting", "by relevance")))((("relevance", "sorting results by"))) 在本章的稍后,我们会解释_相关性_意味着什么和它是如何计算的,让我们开始的时候着眼于`sort`参数和如何使用它吧。 -In order to sort by relevance, we need to represent relevance as a value. In -Elasticsearch, the _relevance score_ is represented by the floating-point -number returned in the search results as the `_score`, ((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))so the default sort -order is `_score` descending. -Sometimes, though, you don't have a meaningful relevance score. For instance, -the following query just returns all tweets whose `user_id` field has the -value `1`: + +=== 排序 + + + +为了按照相关性来排序,需要将相关性表示为一个值。在elasticsearch中, _relevance score_ 是作为一个浮点数,并有结果中的 `_score`返回,((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))因此默认排序是`_score`降序的。 + + +有些时候,尽管你并没有一个有意义的相关性洗漱。例如,下面的查询返回所有 `user_id` 字段包含1的结果 + [source,js] -------------------------------------------------- @@ -33,14 +32,13 @@ GET /_search } -------------------------------------------------- -Filters have no bearing on `_score`, and the((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and"))) missing-but-implied `match_all` -query just sets the `_score` to a neutral value of `1` for all documents. In -other words, all documents are considered to be equally relevant. +筛选不与`_score`相关,并且((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and")))默认的隐式的`match_all`查询仅将所有文档的`_score`设置为中性的`1`。即为,所有的文档被认定是同等相关性的。 + -==== Sorting by Field Values +==== 按照字段的值排序 -In this case, it probably makes sense to sort tweets by recency, with the most -recent tweets first.((("sorting", "by field values")))((("fields", "sorting search results by field values")))((("sort parameter"))) We can do this with the `sort` parameter: + +在这个案例中,通过最近修改来排序是有意义的,最新的排在最前。((("sorting", "by field values")))((("fields", "sorting search results by field values")))((("sort parameter")))我们可以使用`sort`参数 [source,js] -------------------------------------------------- @@ -77,39 +75,33 @@ You will notice two differences in the results: ... } -------------------------------------------------- -<1> The `_score` is not calculated, because it is not being used for sorting. -<2> The value of the `date` field, expressed as milliseconds since the epoch, - is returned in the `sort` values. - -The first is that we have ((("date field, sorting search results by")))a new element in each result called `sort`, which -contains the value(s) that was used for sorting. In this case, we sorted on -`date`, which internally is((("milliseconds-since-the-epoch (date)"))) indexed as _milliseconds since the epoch_. The long -number `1411516800000` is equivalent to the date string `2014-09-24 00:00:00 -UTC`. - -The second is that the `_score` and `max_score` are both `null`. ((("score", "not calculating"))) Calculating -the `_score` can be quite expensive, and usually its only purpose is for -sorting; we're not sorting by relevance, so it doesn't make sense to keep -track of the `_score`. If you want the `_score` to be calculated regardless, -you can set((("track_scores parameter"))) the `track_scores` parameter to `true`. +<1> `_score` 不是被计算的, 因为它并没有用于排序。 +<2> `date` 字段的值将转化为unix时间戳毫秒数,然后返回`sort`字段的值 + + +第一点是我们在每个结果中有((("date field, sorting search results by")))一个新的名为`sort`的元素,它包含了我们用于排序的值。在这个案例中,我们按照`date`进行排序(这由unix时间戳毫秒数得到)。长数 `1411516800000` 等驾驭时间戳字符串`2014-09-24 00:00:00 +UTC`。 + + +第二点是`_score` 和 `max_score`字段都是`null`。((("score", "not calculating")))计算`_score`的花销巨大,通常仅用于排序;我们并不根据相关性排序,所以保留`_score`的记录是没有意义的。如果无论如何你都要计算`_score`,你可以将((("track_scores parameter"))) `track_scores` 参数设置为 `true`. + [TIP] ==== -As a shortcut, you can ((("sorting", "specifying just the field name to sort on")))specify just the name of the field to sort on: +一个简便方法是, 你可以 ((("sorting", "specifying just the field name to sort on")))指定定一个字段用来排序 [source,js] -------------------------------------------------- "sort": "number_of_children" -------------------------------------------------- -Fields will be sorted in ((("sorting", "default ordering")))ascending order by default, and -the `_score` value in descending order. +字段将会默认升序排序 ((("sorting", "default ordering"))), 而 `_score`的值 将会降序 ==== ==== Multilevel Sorting -Perhaps we want to combine the `_score` from a((("sorting", "multilevel")))((("multilevel sorting"))) query with the `date`, and -show all matching results sorted first by date, then by relevance: + +也许我们想要结合使用`date`和`_score`进行查询,并且匹配的结果首先按照日期排序,然后按照相关性排序 [source,js] -------------------------------------------------- @@ -129,18 +121,17 @@ GET /_search -------------------------------------------------- // SENSE: 056_Sorting/85_Multilevel_sort.json -Order is important. Results are sorted by the first criterion first. Only -results whose first `sort` value is identical will then be sorted by the -second criterion, and so on. -Multilevel sorting doesn't have to involve the `_score`. You could sort -by using several different fields,((("fields", "sorting by multiple fields"))) on geo-distance or on a custom value -calculated in a script. +顺序是重要的。结果首先被第一个规则排序,仅当同时满足第一个规则时才会按照第二个规则进行排序,其余类似。 + + +多重排序和`_score`并无不相关。你可以根据一些不同的字段进行排序,((("fields", "sorting by multiple fields"))),如地理距离或是脚本计算的特定值。 [NOTE] ==== -Query-string search((("sorting", "in query string searches")))((("sort parameter", "using in query strings")))((("query strings", "sorting search results for"))) also supports custom sorting, using the `sort` parameter -in the query string: + +字符串查询((("sorting", "in query string searches")))((("sort parameter", "using in query strings")))((("query strings", "sorting search results for")))也支持特定排序,可以在查询字符串中使用`sort`参数 + [source,js] -------------------------------------------------- @@ -148,15 +139,12 @@ GET /_search?sort=date:desc&sort=_score&q=search -------------------------------------------------- ==== -==== Sorting on Multivalue Fields +==== 字段多值的排序 + +一种情形是字段有多个值的排序,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) 需要记住这些值并没有固有的顺序;一个多值的字段仅仅是多个值的包装,这时应道选择那个进行排序呢? -When sorting on fields with more than one value,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) remember that the values do -not have any intrinsic order; a multivalue field is just a bag of values. -Which one do you choose to sort on? +对于数字或事日期,你可以将多值字段减为单值,这可以通过使用 `min`, `max`, `avg`, 或是 `sum` _sort modes_。 ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))例如你可以按照每个`date`字段中的最早日期进行排序,如下: -For numbers and dates, you can reduce a multivalue field to a single value -by using the `min`, `max`, `avg`, or `sum` _sort modes_. ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))For instance, you -could sort on the earliest date in each `dates` field by using the following: [role="pagebreak-before"] [source,js] From dfe7b919e1d7e1fc826090ebd910bef07b41facb Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 15:49:23 +0800 Subject: [PATCH 44/95] Update 85_Sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 9f682cafe..b9e219ff4 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -2,7 +2,7 @@ == 排序与相关性 -默认的是,返回的结果是按照_相关性_进行排序的—相关性最强的文档在最前。((("sorting", "by relevance")))((("relevance", "sorting results by"))) 在本章的稍后,我们会解释_相关性_意味着什么和它是如何计算的,让我们开始的时候着眼于`sort`参数和如何使用它吧。 +默认的是,返回的结果是按照 _相关性_ 进行排序的—相关性最强的文档在最前。((("sorting", "by relevance")))((("relevance", "sorting results by"))) 在本章的稍后,我们会解释_相关性_意味着什么和它是如何计算的,让我们开始的时候着眼于`sort`参数和如何使用它吧。 @@ -32,7 +32,7 @@ GET /_search } -------------------------------------------------- -筛选不与`_score`相关,并且((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and")))默认的隐式的`match_all`查询仅将所有文档的`_score`设置为中性的`1`。即为,所有的文档被认定是同等相关性的。 +筛选不与`_score`相关,并且((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and")))默认的隐式的`match_all`查询仅将所有文档的 `_score` 设置为中性的`1`。即为,所有的文档被认定是同等相关性的。 ==== 按照字段的值排序 From 3751505efe9b8eb36731d87d0c2bbe64d2d3a8d2 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 15:52:06 +0800 Subject: [PATCH 45/95] Update 85_Sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index b9e219ff4..6e3875514 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -2,7 +2,7 @@ == 排序与相关性 -默认的是,返回的结果是按照 _相关性_ 进行排序的—相关性最强的文档在最前。((("sorting", "by relevance")))((("relevance", "sorting results by"))) 在本章的稍后,我们会解释_相关性_意味着什么和它是如何计算的,让我们开始的时候着眼于`sort`参数和如何使用它吧。 +默认的是,返回的结果是按照 _相关性_ 进行排序的—相关性最强的文档在最前。((("sorting", "by relevance")))((("relevance", "sorting results by"))) 在本章的稍后,我们会解释 _相关性_ 意味着什么和它是如何计算的,让我们开始的时候着眼于 `sort` 参数和如何使用它吧。 @@ -10,7 +10,7 @@ -为了按照相关性来排序,需要将相关性表示为一个值。在elasticsearch中, _relevance score_ 是作为一个浮点数,并有结果中的 `_score`返回,((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))因此默认排序是`_score`降序的。 +为了按照相关性来排序,需要将相关性表示为一个值。在elasticsearch中, _relevance score_ 是作为一个浮点数,并有结果中的 `_score` 返回,((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))因此默认排序是 `_score` 降序的。 有些时候,尽管你并没有一个有意义的相关性洗漱。例如,下面的查询返回所有 `user_id` 字段包含1的结果 @@ -32,13 +32,13 @@ GET /_search } -------------------------------------------------- -筛选不与`_score`相关,并且((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and")))默认的隐式的`match_all`查询仅将所有文档的 `_score` 设置为中性的`1`。即为,所有的文档被认定是同等相关性的。 +筛选不与 `_score` 相关,并且((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and")))默认的隐式的 `match_all` 查询仅将所有文档的 `_score` 设置为中性的 `1` 。即为,所有的文档被认定是同等相关性的。 ==== 按照字段的值排序 -在这个案例中,通过最近修改来排序是有意义的,最新的排在最前。((("sorting", "by field values")))((("fields", "sorting search results by field values")))((("sort parameter")))我们可以使用`sort`参数 +在这个案例中,通过最近修改来排序是有意义的,最新的排在最前。((("sorting", "by field values")))((("fields", "sorting search results by field values")))((("sort parameter")))我们可以使用 `sort` 参数 [source,js] -------------------------------------------------- @@ -79,11 +79,11 @@ You will notice two differences in the results: <2> `date` 字段的值将转化为unix时间戳毫秒数,然后返回`sort`字段的值 -第一点是我们在每个结果中有((("date field, sorting search results by")))一个新的名为`sort`的元素,它包含了我们用于排序的值。在这个案例中,我们按照`date`进行排序(这由unix时间戳毫秒数得到)。长数 `1411516800000` 等驾驭时间戳字符串`2014-09-24 00:00:00 +第一点是我们在每个结果中有((("date field, sorting search results by")))一个新的名为 `sort` 的元素,它包含了我们用于排序的值。在这个案例中,我们按照 `date` 进行排序(这由unix时间戳毫秒数得到)。长数 `1411516800000` 等驾驭时间戳字符串 `2014-09-24 00:00:00 UTC`。 -第二点是`_score` 和 `max_score`字段都是`null`。((("score", "not calculating")))计算`_score`的花销巨大,通常仅用于排序;我们并不根据相关性排序,所以保留`_score`的记录是没有意义的。如果无论如何你都要计算`_score`,你可以将((("track_scores parameter"))) `track_scores` 参数设置为 `true`. +第二点是 `_score` 和 `max_score` 字段都是 `null` 。((("score", "not calculating")))计算 `_score` 的花销巨大,通常仅用于排序;我们并不根据相关性排序,所以保留 `_score` 的记录是没有意义的。如果无论如何你都要计算 `_score` ,你可以将((("track_scores parameter"))) `track_scores` 参数设置为 `true`. [TIP] @@ -95,13 +95,13 @@ UTC`。 "sort": "number_of_children" -------------------------------------------------- -字段将会默认升序排序 ((("sorting", "default ordering"))), 而 `_score`的值 将会降序 +字段将会默认升序排序 ((("sorting", "default ordering"))), 而 `_score` 的值 将会降序 ==== ==== Multilevel Sorting -也许我们想要结合使用`date`和`_score`进行查询,并且匹配的结果首先按照日期排序,然后按照相关性排序 +也许我们想要结合使用 `date` 和 `_score` 进行查询,并且匹配的结果首先按照日期排序,然后按照相关性排序 [source,js] -------------------------------------------------- @@ -125,12 +125,12 @@ GET /_search 顺序是重要的。结果首先被第一个规则排序,仅当同时满足第一个规则时才会按照第二个规则进行排序,其余类似。 -多重排序和`_score`并无不相关。你可以根据一些不同的字段进行排序,((("fields", "sorting by multiple fields"))),如地理距离或是脚本计算的特定值。 +多重排序和 `_score` 并无不相关。你可以根据一些不同的字段进行排序,((("fields", "sorting by multiple fields"))),如地理距离或是脚本计算的特定值。 [NOTE] ==== -字符串查询((("sorting", "in query string searches")))((("sort parameter", "using in query strings")))((("query strings", "sorting search results for")))也支持特定排序,可以在查询字符串中使用`sort`参数 +字符串查询((("sorting", "in query string searches")))((("sort parameter", "using in query strings")))((("query strings", "sorting search results for")))也支持特定排序,可以在查询字符串中使用 `sort` 参数 [source,js] @@ -143,7 +143,7 @@ GET /_search?sort=date:desc&sort=_score&q=search 一种情形是字段有多个值的排序,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) 需要记住这些值并没有固有的顺序;一个多值的字段仅仅是多个值的包装,这时应道选择那个进行排序呢? -对于数字或事日期,你可以将多值字段减为单值,这可以通过使用 `min`, `max`, `avg`, 或是 `sum` _sort modes_。 ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))例如你可以按照每个`date`字段中的最早日期进行排序,如下: +对于数字或事日期,你可以将多值字段减为单值,这可以通过使用 `min`, `max` , `avg` , 或是 `sum` _sort modes_ 。 ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))例如你可以按照每个`date`字段中的最早日期进行排序,如下: [role="pagebreak-before"] From 1dd977a17c4b15d0f045968b7a3a9ffbf1515051 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 15:52:54 +0800 Subject: [PATCH 46/95] Update 85_Sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 6e3875514..377caa249 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -143,7 +143,7 @@ GET /_search?sort=date:desc&sort=_score&q=search 一种情形是字段有多个值的排序,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) 需要记住这些值并没有固有的顺序;一个多值的字段仅仅是多个值的包装,这时应道选择那个进行排序呢? -对于数字或事日期,你可以将多值字段减为单值,这可以通过使用 `min`, `max` , `avg` , 或是 `sum` _sort modes_ 。 ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))例如你可以按照每个`date`字段中的最早日期进行排序,如下: +对于数字或事日期,你可以将多值字段减为单值,这可以通过使用 `min`, `max` , `avg` , 或是 `sum` _sort modes_ 。 ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))例如你可以按照每个 `date` 字段中的最早日期进行排序,如下: [role="pagebreak-before"] From 819cf2315d6d1d4ae99ea882e85e91b2fe65ed1c Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 16:17:12 +0800 Subject: [PATCH 47/95] Update 85_Sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 377caa249..ca4878252 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -10,10 +10,10 @@ -为了按照相关性来排序,需要将相关性表示为一个值。在elasticsearch中, _relevance score_ 是作为一个浮点数,并有结果中的 `_score` 返回,((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))因此默认排序是 `_score` 降序的。 +为了按照相关性来排序,需要将相关性表示为一个值。在elasticsearch中, _relevance score_ 是作为一个浮点数,并在结果中的 `_score` 返回,((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))因此默认排序是 `_score` 降序的。 -有些时候,尽管你并没有一个有意义的相关性洗漱。例如,下面的查询返回所有 `user_id` 字段包含1的结果 +有些时候,尽管你并没有一个有意义的相关性系数。例如,下面的查询返回所有 `user_id` 字段包含 `1` 的结果 [source,js] From 46ff94f02c6e06ed15fd5fc27b69a2b7f9b46d09 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 16:19:31 +0800 Subject: [PATCH 48/95] Update 85_Sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index ca4878252..3bd4a6e3a 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -54,7 +54,7 @@ GET /_search -------------------------------------------------- // SENSE: 056_Sorting/85_Sort_by_date.json -You will notice two differences in the results: +你将注意结果中的两个不同点: [source,js] -------------------------------------------------- @@ -79,11 +79,11 @@ You will notice two differences in the results: <2> `date` 字段的值将转化为unix时间戳毫秒数,然后返回`sort`字段的值 -第一点是我们在每个结果中有((("date field, sorting search results by")))一个新的名为 `sort` 的元素,它包含了我们用于排序的值。在这个案例中,我们按照 `date` 进行排序(这由unix时间戳毫秒数得到)。长数 `1411516800000` 等驾驭时间戳字符串 `2014-09-24 00:00:00 +第一点是我们在每个结果中有((("date field, sorting search results by")))一个新的名为 `sort` 的元素,它包含了我们用于排序的值。在这个案例中,我们按照 `date` 进行排序(这由unix时间戳毫秒数得到)。长数 `1411516800000` 等价于时间戳字符串 `2014-09-24 00:00:00 UTC`。 -第二点是 `_score` 和 `max_score` 字段都是 `null` 。((("score", "not calculating")))计算 `_score` 的花销巨大,通常仅用于排序;我们并不根据相关性排序,所以保留 `_score` 的记录是没有意义的。如果无论如何你都要计算 `_score` ,你可以将((("track_scores parameter"))) `track_scores` 参数设置为 `true`. +第二点是 `_score` 和 `max_score` 字段都是 `null` 。((("score", "not calculating")))计算 `_score` 的花销巨大,通常仅用于排序;我们并不根据相关性排序,所以保留 `_score` 的痕迹是没有意义的。如果无论如何你都要计算 `_score` ,你可以将((("track_scores parameter"))) `track_scores` 参数设置为 `true`. [TIP] @@ -95,7 +95,7 @@ UTC`。 "sort": "number_of_children" -------------------------------------------------- -字段将会默认升序排序 ((("sorting", "default ordering"))), 而 `_score` 的值 将会降序 +字段将会默认升序排序 ((("sorting", "default ordering"))), 而 `_score` 的值将会降序 ==== ==== Multilevel Sorting From c57793cfd45ad12ee3fe5cae71828a4c6ec07ae5 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 16:35:44 +0800 Subject: [PATCH 49/95] Update 88_String_sorting.asciidoc --- 056_Sorting/88_String_sorting.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 8f57c4ad8..9d7288876 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -1,5 +1,5 @@ [[multi-fields]] -=== String Sorting and Multifields +=== 字符串排序与多字段 Analyzed string fields are also multivalue fields,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) but sorting on them seldom gives you the results you want. If you analyze a string like `fine old art`, From b811bc1e25ea3af6066d57e33c2329cd737db4fd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E6=A8=8A=E7=83=A8=E5=B0=94?= Date: Wed, 20 Jul 2016 16:56:43 +0800 Subject: [PATCH 50/95] daily update --- 056_Sorting/88_String_sorting.asciidoc | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 9d7288876..4b5b5f5a5 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -1,4 +1,4 @@ -[[multi-fields]] +[[多字段]] === 字符串排序与多字段 Analyzed string fields are also multivalue fields,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) but sorting on them seldom @@ -7,13 +7,22 @@ it results in three terms. We probably want to sort alphabetically on the first term, then the second term, and so forth, but Elasticsearch doesn't have this information at its disposal at sort time. -You could use the `min` and `max` sort modes (it uses `min` by default), but -that will result in sorting on either `art` or `old`, neither of which was the -intent. -In order to sort on a string field, that field should contain one term only: -the whole `not_analyzed` string.((("not_analyzed string fields", "sorting on"))) But of course we still need the field to be -`analyzed` in order to be able to query it as full text. + + + + + + + +你可以使用 `min` 和 `max` 排序模式(默认是 `min` ),但是这会导致排序以 `art` 或是 `old` ,任何一个都不是所希望的 + + + +为了以字符串字段进行排序, 这个字段应仅包含一项: +整个 `not_analyzed` 字符串。((("not_analyzed string fields", "sorting on"))) 但是我们仍需要 `analyzed` 字段,这样才能以全文进行查询 + + The naive approach to indexing the same string in two ways would be to include two separate fields in the document: one that is `analyzed` for searching, @@ -49,7 +58,7 @@ into a _multifield_ mapping like this: -------------------------------------------------- // SENSE: 056_Sorting/88_Multifield.json -<1> The main `tweet` field is just the same as before: an `analyzed` full-text +<1> `tweet` 主字段 is just the same as before: an `analyzed` full-text field. <2> The new `tweet.raw` subfield is `not_analyzed`. From 51049d4c61e94908c01d80ae636476ce7b15ef0c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E6=A8=8A=E7=83=A8=E5=B0=94?= Date: Wed, 20 Jul 2016 17:36:54 +0800 Subject: [PATCH 51/95] daily refresh --- 056_Sorting/88_String_sorting.asciidoc | 39 ++++++++++++-------------- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 4b5b5f5a5..eabe21b00 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -1,18 +1,9 @@ [[多字段]] === 字符串排序与多字段 -Analyzed string fields are also multivalue fields,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) but sorting on them seldom -gives you the results you want. If you analyze a string like `fine old art`, -it results in three terms. We probably want to sort alphabetically on the -first term, then the second term, and so forth, but Elasticsearch doesn't have this -information at its disposal at sort time. - - - - - - +被解析的字符串字段也是多值字段,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) 但是很少会按照你想要的方式进行排序。如果你想分析一个字符串,如 `fine old art` , +这包含3项。我们很坑想要按第一项的字母排序,然后按第二项的字母排序,诸如此类,但是Elasticsearch在排序过程中没有这样的信息。 你可以使用 `min` 和 `max` 排序模式(默认是 `min` ),但是这会导致排序以 `art` 或是 `old` ,任何一个都不是所希望的 @@ -24,15 +15,21 @@ information at its disposal at sort time. -The naive approach to indexing the same string in two ways would be to include -two separate fields in the document: one that is `analyzed` for searching, -and one that is `not_analyzed` for sorting. +一个简单的方法是两种方式对同一个字符串进行索引,这将在文档中包括两个字段 : `analyzed` 用于搜索, `not_analyzed` 用于排序 + But storing the same string twice in the `_source` field is waste of space. What we really want to do is to pass in a _single field_ but to _index it in two different ways_. All of the _core_ field types (strings, numbers, Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping")))that allows you to transform a simple mapping like +但是保存相同的字符串两次在 `_source` 字段是浪费空间的。 +我们真正想要做的是传递一个 _单字段_ 但是 却用两种方式索引它。所有的 _core_field 类型 (strings, numbers, Booleans, dates) 接收一个 `字段s` 参数((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping"))) + +该参数允许你转化一个简单的映射如 + + + [source,js] -------------------------------------------------- "tweet": { @@ -41,7 +38,7 @@ simple mapping like } -------------------------------------------------- -into a _multifield_ mapping like this: +为一个多字段映射如: [source,js] -------------------------------------------------- @@ -58,12 +55,12 @@ into a _multifield_ mapping like this: -------------------------------------------------- // SENSE: 056_Sorting/88_Multifield.json -<1> `tweet` 主字段 is just the same as before: an `analyzed` full-text - field. -<2> The new `tweet.raw` subfield is `not_analyzed`. +<1> `tweet` 主字段与之前的一样: 是一个 `analyzed` 全文字段。 +<2> 新的 `tweet.raw` 子字段是 `not_analyzed`. + + +现在, 至少我们已经重新索引了我们的数据,使用 `tweet` 字段用于搜索,`tweet.raw` 字段用于排序: -Now, or at least as soon as we have reindexed our data, we can use the `tweet` -field for search and the `tweet.raw` field for sorting: [source,js] -------------------------------------------------- @@ -79,6 +76,6 @@ GET /_search -------------------------------------------------- // SENSE: 056_Sorting/88_Multifield.json -WARNING: Sorting on a full-text `analyzed` field can use a lot of memory. See +WARNING: 以全文 `analyzed` 字段排序会消耗大量的内存. See <> for more information. From 75aff0a9e000a239803a599088ddf86878cb9bb6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E6=A8=8A=E7=83=A8=E5=B0=94?= Date: Wed, 20 Jul 2016 17:40:06 +0800 Subject: [PATCH 52/95] fixed unfit edition --- 056_Sorting/88_String_sorting.asciidoc | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index eabe21b00..6d322b68e 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -15,13 +15,9 @@ -一个简单的方法是两种方式对同一个字符串进行索引,这将在文档中包括两个字段 : `analyzed` 用于搜索, `not_analyzed` 用于排序 +一个简单的方法是用两种方式对同一个字符串进行索引,这将在文档中包括两个字段 : `analyzed` 用于搜索, `not_analyzed` 用于排序 -But storing the same string twice in the `_source` field is waste of space. -What we really want to do is to pass in a _single field_ but to _index it in two different ways_. All of the _core_ field types (strings, numbers, -Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping")))that allows you to transform a -simple mapping like 但是保存相同的字符串两次在 `_source` 字段是浪费空间的。 我们真正想要做的是传递一个 _单字段_ 但是 却用两种方式索引它。所有的 _core_field 类型 (strings, numbers, Booleans, dates) 接收一个 `字段s` 参数((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping"))) From 61b24a0c6795cf76dcca158cb127e9aebee4d353 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 19:12:04 +0800 Subject: [PATCH 53/95] no message --- 056_Sorting/88_String_sorting.asciidoc | 77 -------------------------- 1 file changed, 77 deletions(-) delete mode 100644 056_Sorting/88_String_sorting.asciidoc diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc deleted file mode 100644 index 6d322b68e..000000000 --- a/056_Sorting/88_String_sorting.asciidoc +++ /dev/null @@ -1,77 +0,0 @@ -[[多字段]] -=== 字符串排序与多字段 - - -被解析的字符串字段也是多值字段,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) 但是很少会按照你想要的方式进行排序。如果你想分析一个字符串,如 `fine old art` , -这包含3项。我们很坑想要按第一项的字母排序,然后按第二项的字母排序,诸如此类,但是Elasticsearch在排序过程中没有这样的信息。 - - -你可以使用 `min` 和 `max` 排序模式(默认是 `min` ),但是这会导致排序以 `art` 或是 `old` ,任何一个都不是所希望的 - - - -为了以字符串字段进行排序, 这个字段应仅包含一项: -整个 `not_analyzed` 字符串。((("not_analyzed string fields", "sorting on"))) 但是我们仍需要 `analyzed` 字段,这样才能以全文进行查询 - - - -一个简单的方法是用两种方式对同一个字符串进行索引,这将在文档中包括两个字段 : `analyzed` 用于搜索, `not_analyzed` 用于排序 - - - -但是保存相同的字符串两次在 `_source` 字段是浪费空间的。 -我们真正想要做的是传递一个 _单字段_ 但是 却用两种方式索引它。所有的 _core_field 类型 (strings, numbers, Booleans, dates) 接收一个 `字段s` 参数((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping"))) - -该参数允许你转化一个简单的映射如 - - - -[source,js] --------------------------------------------------- -"tweet": { - "type": "string", - "analyzer": "english" -} --------------------------------------------------- - -为一个多字段映射如: - -[source,js] --------------------------------------------------- -"tweet": { <1> - "type": "string", - "analyzer": "english", - "fields": { - "raw": { <2> - "type": "string", - "index": "not_analyzed" - } - } -} --------------------------------------------------- -// SENSE: 056_Sorting/88_Multifield.json - -<1> `tweet` 主字段与之前的一样: 是一个 `analyzed` 全文字段。 -<2> 新的 `tweet.raw` 子字段是 `not_analyzed`. - - -现在, 至少我们已经重新索引了我们的数据,使用 `tweet` 字段用于搜索,`tweet.raw` 字段用于排序: - - -[source,js] --------------------------------------------------- -GET /_search -{ - "query": { - "match": { - "tweet": "elasticsearch" - } - }, - "sort": "tweet.raw" -} --------------------------------------------------- -// SENSE: 056_Sorting/88_Multifield.json - -WARNING: 以全文 `analyzed` 字段排序会消耗大量的内存. See -<> for more information. - From 1e50bd09b9a56a1799b7d6d7378e907a83b30182 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 19:12:30 +0800 Subject: [PATCH 54/95] chapter8_part2 --- 056_Sorting/88_String_sorting.asciidoc | 77 ++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) create mode 100644 056_Sorting/88_String_sorting.asciidoc diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc new file mode 100644 index 000000000..6d322b68e --- /dev/null +++ b/056_Sorting/88_String_sorting.asciidoc @@ -0,0 +1,77 @@ +[[多字段]] +=== 字符串排序与多字段 + + +被解析的字符串字段也是多值字段,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) 但是很少会按照你想要的方式进行排序。如果你想分析一个字符串,如 `fine old art` , +这包含3项。我们很坑想要按第一项的字母排序,然后按第二项的字母排序,诸如此类,但是Elasticsearch在排序过程中没有这样的信息。 + + +你可以使用 `min` 和 `max` 排序模式(默认是 `min` ),但是这会导致排序以 `art` 或是 `old` ,任何一个都不是所希望的 + + + +为了以字符串字段进行排序, 这个字段应仅包含一项: +整个 `not_analyzed` 字符串。((("not_analyzed string fields", "sorting on"))) 但是我们仍需要 `analyzed` 字段,这样才能以全文进行查询 + + + +一个简单的方法是用两种方式对同一个字符串进行索引,这将在文档中包括两个字段 : `analyzed` 用于搜索, `not_analyzed` 用于排序 + + + +但是保存相同的字符串两次在 `_source` 字段是浪费空间的。 +我们真正想要做的是传递一个 _单字段_ 但是 却用两种方式索引它。所有的 _core_field 类型 (strings, numbers, Booleans, dates) 接收一个 `字段s` 参数((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping"))) + +该参数允许你转化一个简单的映射如 + + + +[source,js] +-------------------------------------------------- +"tweet": { + "type": "string", + "analyzer": "english" +} +-------------------------------------------------- + +为一个多字段映射如: + +[source,js] +-------------------------------------------------- +"tweet": { <1> + "type": "string", + "analyzer": "english", + "fields": { + "raw": { <2> + "type": "string", + "index": "not_analyzed" + } + } +} +-------------------------------------------------- +// SENSE: 056_Sorting/88_Multifield.json + +<1> `tweet` 主字段与之前的一样: 是一个 `analyzed` 全文字段。 +<2> 新的 `tweet.raw` 子字段是 `not_analyzed`. + + +现在, 至少我们已经重新索引了我们的数据,使用 `tweet` 字段用于搜索,`tweet.raw` 字段用于排序: + + +[source,js] +-------------------------------------------------- +GET /_search +{ + "query": { + "match": { + "tweet": "elasticsearch" + } + }, + "sort": "tweet.raw" +} +-------------------------------------------------- +// SENSE: 056_Sorting/88_Multifield.json + +WARNING: 以全文 `analyzed` 字段排序会消耗大量的内存. See +<> for more information. + From 138a24e03bfa11542e096f91250eabb2a8c251d0 Mon Sep 17 00:00:00 2001 From: fanyer Date: Wed, 20 Jul 2016 19:28:02 +0800 Subject: [PATCH 55/95] / 056_Sorting/ 88_String_sorting.asciidoc --- 056_Sorting/85_Sorting.asciidoc | 98 ++++++++++++++++++--------------- 1 file changed, 55 insertions(+), 43 deletions(-) diff --git a/056_Sorting/85_Sorting.asciidoc b/056_Sorting/85_Sorting.asciidoc index 3bd4a6e3a..28d1e7977 100644 --- a/056_Sorting/85_Sorting.asciidoc +++ b/056_Sorting/85_Sorting.asciidoc @@ -1,20 +1,21 @@ [[sorting]] -== 排序与相关性 +== Sorting and Relevance +By default, results are returned sorted by _relevance_—with the most +relevant docs first.((("sorting", "by relevance")))((("relevance", "sorting results by"))) Later in this chapter, we explain what we mean by +_relevance_ and how it is calculated, but let's start by looking at the `sort` +parameter and how to use it. -默认的是,返回的结果是按照 _相关性_ 进行排序的—相关性最强的文档在最前。((("sorting", "by relevance")))((("relevance", "sorting results by"))) 在本章的稍后,我们会解释 _相关性_ 意味着什么和它是如何计算的,让我们开始的时候着眼于 `sort` 参数和如何使用它吧。 +=== Sorting +In order to sort by relevance, we need to represent relevance as a value. In +Elasticsearch, the _relevance score_ is represented by the floating-point +number returned in the search results as the `_score`, ((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))so the default sort +order is `_score` descending. - -=== 排序 - - - -为了按照相关性来排序,需要将相关性表示为一个值。在elasticsearch中, _relevance score_ 是作为一个浮点数,并在结果中的 `_score` 返回,((("relevance scores", "returned in search results score")))((("score", "relevance score of search results")))因此默认排序是 `_score` 降序的。 - - -有些时候,尽管你并没有一个有意义的相关性系数。例如,下面的查询返回所有 `user_id` 字段包含 `1` 的结果 - +Sometimes, though, you don't have a meaningful relevance score. For instance, +the following query just returns all tweets whose `user_id` field has the +value `1`: [source,js] -------------------------------------------------- @@ -32,13 +33,14 @@ GET /_search } -------------------------------------------------- -筛选不与 `_score` 相关,并且((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and")))默认的隐式的 `match_all` 查询仅将所有文档的 `_score` 设置为中性的 `1` 。即为,所有的文档被认定是同等相关性的。 - +Filters have no bearing on `_score`, and the((("score", seealso="relevance; relevance scores")))((("match_all query", "score as neutral 1")))((("filters", "score and"))) missing-but-implied `match_all` +query just sets the `_score` to a neutral value of `1` for all documents. In +other words, all documents are considered to be equally relevant. -==== 按照字段的值排序 +==== Sorting by Field Values - -在这个案例中,通过最近修改来排序是有意义的,最新的排在最前。((("sorting", "by field values")))((("fields", "sorting search results by field values")))((("sort parameter")))我们可以使用 `sort` 参数 +In this case, it probably makes sense to sort tweets by recency, with the most +recent tweets first.((("sorting", "by field values")))((("fields", "sorting search results by field values")))((("sort parameter"))) We can do this with the `sort` parameter: [source,js] -------------------------------------------------- @@ -54,7 +56,7 @@ GET /_search -------------------------------------------------- // SENSE: 056_Sorting/85_Sort_by_date.json -你将注意结果中的两个不同点: +You will notice two differences in the results: [source,js] -------------------------------------------------- @@ -75,33 +77,39 @@ GET /_search ... } -------------------------------------------------- -<1> `_score` 不是被计算的, 因为它并没有用于排序。 -<2> `date` 字段的值将转化为unix时间戳毫秒数,然后返回`sort`字段的值 - - -第一点是我们在每个结果中有((("date field, sorting search results by")))一个新的名为 `sort` 的元素,它包含了我们用于排序的值。在这个案例中,我们按照 `date` 进行排序(这由unix时间戳毫秒数得到)。长数 `1411516800000` 等价于时间戳字符串 `2014-09-24 00:00:00 -UTC`。 - - -第二点是 `_score` 和 `max_score` 字段都是 `null` 。((("score", "not calculating")))计算 `_score` 的花销巨大,通常仅用于排序;我们并不根据相关性排序,所以保留 `_score` 的痕迹是没有意义的。如果无论如何你都要计算 `_score` ,你可以将((("track_scores parameter"))) `track_scores` 参数设置为 `true`. - +<1> The `_score` is not calculated, because it is not being used for sorting. +<2> The value of the `date` field, expressed as milliseconds since the epoch, + is returned in the `sort` values. + +The first is that we have ((("date field, sorting search results by")))a new element in each result called `sort`, which +contains the value(s) that was used for sorting. In this case, we sorted on +`date`, which internally is((("milliseconds-since-the-epoch (date)"))) indexed as _milliseconds since the epoch_. The long +number `1411516800000` is equivalent to the date string `2014-09-24 00:00:00 +UTC`. + +The second is that the `_score` and `max_score` are both `null`. ((("score", "not calculating"))) Calculating +the `_score` can be quite expensive, and usually its only purpose is for +sorting; we're not sorting by relevance, so it doesn't make sense to keep +track of the `_score`. If you want the `_score` to be calculated regardless, +you can set((("track_scores parameter"))) the `track_scores` parameter to `true`. [TIP] ==== -一个简便方法是, 你可以 ((("sorting", "specifying just the field name to sort on")))指定定一个字段用来排序 +As a shortcut, you can ((("sorting", "specifying just the field name to sort on")))specify just the name of the field to sort on: [source,js] -------------------------------------------------- "sort": "number_of_children" -------------------------------------------------- -字段将会默认升序排序 ((("sorting", "default ordering"))), 而 `_score` 的值将会降序 +Fields will be sorted in ((("sorting", "default ordering")))ascending order by default, and +the `_score` value in descending order. ==== ==== Multilevel Sorting - -也许我们想要结合使用 `date` 和 `_score` 进行查询,并且匹配的结果首先按照日期排序,然后按照相关性排序 +Perhaps we want to combine the `_score` from a((("sorting", "multilevel")))((("multilevel sorting"))) query with the `date`, and +show all matching results sorted first by date, then by relevance: [source,js] -------------------------------------------------- @@ -121,17 +129,18 @@ GET /_search -------------------------------------------------- // SENSE: 056_Sorting/85_Multilevel_sort.json +Order is important. Results are sorted by the first criterion first. Only +results whose first `sort` value is identical will then be sorted by the +second criterion, and so on. -顺序是重要的。结果首先被第一个规则排序,仅当同时满足第一个规则时才会按照第二个规则进行排序,其余类似。 - - -多重排序和 `_score` 并无不相关。你可以根据一些不同的字段进行排序,((("fields", "sorting by multiple fields"))),如地理距离或是脚本计算的特定值。 +Multilevel sorting doesn't have to involve the `_score`. You could sort +by using several different fields,((("fields", "sorting by multiple fields"))) on geo-distance or on a custom value +calculated in a script. [NOTE] ==== - -字符串查询((("sorting", "in query string searches")))((("sort parameter", "using in query strings")))((("query strings", "sorting search results for")))也支持特定排序,可以在查询字符串中使用 `sort` 参数 - +Query-string search((("sorting", "in query string searches")))((("sort parameter", "using in query strings")))((("query strings", "sorting search results for"))) also supports custom sorting, using the `sort` parameter +in the query string: [source,js] -------------------------------------------------- @@ -139,12 +148,15 @@ GET /_search?sort=date:desc&sort=_score&q=search -------------------------------------------------- ==== -==== 字段多值的排序 - -一种情形是字段有多个值的排序,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) 需要记住这些值并没有固有的顺序;一个多值的字段仅仅是多个值的包装,这时应道选择那个进行排序呢? +==== Sorting on Multivalue Fields -对于数字或事日期,你可以将多值字段减为单值,这可以通过使用 `min`, `max` , `avg` , 或是 `sum` _sort modes_ 。 ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))例如你可以按照每个 `date` 字段中的最早日期进行排序,如下: +When sorting on fields with more than one value,((("sorting", "on multivalue fields")))((("fields", "multivalue", "sorting on"))) remember that the values do +not have any intrinsic order; a multivalue field is just a bag of values. +Which one do you choose to sort on? +For numbers and dates, you can reduce a multivalue field to a single value +by using the `min`, `max`, `avg`, or `sum` _sort modes_. ((("sum sort mode")))((("avg sort mode")))((("max sort mode")))((("min sort mode")))((("sort modes")))((("dates field, sorting on earliest value")))For instance, you +could sort on the earliest date in each `dates` field by using the following: [role="pagebreak-before"] [source,js] From 1b179c5e0ab5111659049296eb2105782bf563b5 Mon Sep 17 00:00:00 2001 From: fanyer Date: Thu, 21 Jul 2016 11:28:47 +0800 Subject: [PATCH 56/95] =?UTF-8?q?refresh=20=E5=AD=97=E6=AE=B5=E6=95=B0?= =?UTF-8?q?=E6=8D=AE?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 056_Sorting/95_Fielddata.asciidoc | 71 +++++++++++++++---------------- 1 file changed, 34 insertions(+), 37 deletions(-) diff --git a/056_Sorting/95_Fielddata.asciidoc b/056_Sorting/95_Fielddata.asciidoc index 10ba6a947..5e7fd8322 100644 --- a/056_Sorting/95_Fielddata.asciidoc +++ b/056_Sorting/95_Fielddata.asciidoc @@ -1,54 +1,51 @@ -[[fielddata-intro]] -=== Fielddata +[[字段数据介绍]] +=== 字段数据 -Our final topic in this chapter is about an internal aspect of Elasticsearch. -While we don't demonstrate any new techniques here, fielddata is an -important topic that we will refer to repeatedly, and is something that you -should be aware of.((("fielddata"))) -When you sort on a field, Elasticsearch needs access to the value of that -field for every document that matches the query.((("inverted index", "sorting and"))) The inverted index, which -performs very well when searching, is not the ideal structure for sorting on -field values: -* When searching, we need to be able to map a term to a list of documents. +我们这章的终极目标是关于Elasticsearch的一个内部的方面,且我们在这里并不会阐述任何新的技术,字段数据使我们将会重复提到的一个重要话题,并且你应当明确它。((("fielddata"))) -* When sorting, we need to map a document to its terms. In other words, we - need to ``uninvert'' the inverted index. -To make sorting efficient, Elasticsearch loads all the values for -the field that you want to sort on into memory. This is referred to as -_fielddata_. -WARNING: Elasticsearch doesn't just load the values for the documents that matched a -particular query. It loads the values from _every document in your index_, -regardless of the document `type`. +当你以字段进行排序, Elasticsearch需要访问符合查询的每个文档的该字段的值。((("inverted index", "sorting and")))反转的索引(这会对搜索更加友好)在以字段值排序时不是理想的结构。 -The reason that Elasticsearch loads all values into memory is that uninverting the index -from disk is slow. Even though you may need the values for only a few docs -for the current request, you will probably need access to the values for other -docs on the next request, so it makes sense to load all the values into memory -at once, and to keep them there. -Fielddata is used in several places in Elasticsearch: +* 当搜索时,我们需要能将一个文档列表映射到某一项上。 -* Sorting on a field -* Aggregations on a field -* Certain filters (for example, geolocation filters) -* Scripts that refer to fields -Clearly, this can consume a lot of memory, especially for high-cardinality -string fields--string fields that have many unique values--like the body -of an email. Fortunately, insufficient memory is a problem that can be solved -by horizontal scaling, by adding more nodes to your cluster. -For now, all you need to know is what fielddata is, and to be aware that it -can be memory hungry. Later, we will show you how to determine the amount of memory that fielddata -is using, how to limit the amount of memory that is available to it, and -how to preload fielddata to improve the user experience. +* 当排序时, 我们需要映射一个文档到它的某项。 换句话说, 我们需要 ``反向反转`` 已经反转的索引。 +为了使得排序效率更高, Elasticsearch 会在内存中加载你想要以之排序的所有字段的值。 这便是提到的 _字段数据_ 。 + + + + +WARNING: Elasticsearch 并不仅仅加载匹配特定查询的文档的值。 他会加载 _你的数据库中的每个文档_ , 无论这个文档的 `type` + + + + +Elasticsearch在内存中加载所有的值的原因是在硬盘中逆反向索引是很慢的。虽然你当前的请求可能仅仅需要很少文档的值,你仍然可能在下次请求时需要可以访问其他文档的值,所以在内存中立即加载所有的值并驻留是有意义的。 + + + + +字段数据在Elasticsearch中被用于以下地方: + +* 按照字段排序 +* 按照字段聚合 +* 一些特定的筛选(例如,地理筛选) +* 引入字段的脚本 + + +显然的,这会消耗大量的内存,特别是对于高基数的字符串字段--字符串字段有很多独特的值--例如email的body体。幸运的是,内存效率低的问题可以通过增加集群的节点进行水平扩展来解决。 + +现在,所有你需要知道和明确的是它是极度需要内存的。稍后,我们会给你演示如何确定字段数据所占用的内存,如何限制可用的内存,和如何预加载字段数据来提高用户体验。 + + From d55308222f2d117c676eadb0938e2a11f375985b Mon Sep 17 00:00:00 2001 From: fanyer Date: Thu, 21 Jul 2016 11:31:52 +0800 Subject: [PATCH 57/95] error fixed --- 056_Sorting/95_Fielddata.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/056_Sorting/95_Fielddata.asciidoc b/056_Sorting/95_Fielddata.asciidoc index 5e7fd8322..1c3abdd22 100644 --- a/056_Sorting/95_Fielddata.asciidoc +++ b/056_Sorting/95_Fielddata.asciidoc @@ -3,7 +3,7 @@ -我们这章的终极目标是关于Elasticsearch的一个内部的方面,且我们在这里并不会阐述任何新的技术,字段数据使我们将会重复提到的一个重要话题,并且你应当明确它。((("fielddata"))) +我们这章的终极目标是关于Elasticsearch的一个内部的方面,且我们在这里并不会阐述任何新的技术,字段数据是我们将会重复提到的一个重要话题,并且你应当明确它。((("fielddata"))) @@ -24,7 +24,7 @@ -WARNING: Elasticsearch 并不仅仅加载匹配特定查询的文档的值。 他会加载 _你的数据库中的每个文档_ , 无论这个文档的 `type` +WARNING: Elasticsearch 并不仅仅加载匹配特定查询的文档的值。 他会加载 _你的数据库中的每个文档_ , 无论这个文档的 `type` From e29b6cb83076de4f80bb2e13dd13f607174a72d6 Mon Sep 17 00:00:00 2001 From: fanyer Date: Fri, 22 Jul 2016 10:55:09 +0800 Subject: [PATCH 58/95] =?UTF-8?q?=E7=9B=B8=E5=85=B3=E6=80=A7?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 056_Sorting/90_What_is_relevance.asciidoc | 179 +++++++++++----------- 1 file changed, 87 insertions(+), 92 deletions(-) diff --git a/056_Sorting/90_What_is_relevance.asciidoc b/056_Sorting/90_What_is_relevance.asciidoc index d7f0ebd04..efe9abd6e 100644 --- a/056_Sorting/90_What_is_relevance.asciidoc +++ b/056_Sorting/90_What_is_relevance.asciidoc @@ -1,65 +1,59 @@ -[[relevance-intro]] -=== What Is Relevance? +[[相关性简介]] +=== 什么是相关性? -We've mentioned that, by default, results are returned in descending order of -relevance.((("relevance", "defined"))) But what is relevance? How is it calculated? -The relevance score of each document is represented by a positive floating-point number called the `_score`.((("score", "calculation of"))) The higher the `_score`, the more relevant -the document. -A query clause generates a `_score` for each document. How that score is -calculated depends on the type of query clause.((("fuzzy queries", "calculation of relevence score"))) Different query clauses are -used for different purposes: a `fuzzy` query might determine the `_score` by -calculating how similar the spelling of the found word is to the original -search term; a `terms` query would incorporate the percentage of terms that -were found. However, what we usually mean by _relevance_ is the algorithm that we -use to calculate how similar the contents of a full-text field are to a full-text query string. +我们曾经讲过,默认情况下,返回结果是按照相关性倒序排序的,((("relevance", "defined")))但是什么是相关性?相关性如何计算 +? -The standard _similarity algorithm_ used in Elasticsearch is((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)"))) known as _term -frequency/inverse document frequency_, or _TF/IDF_, which takes the following -factors into((("inverse document frequency"))) account: -Term frequency:: - How often does the term appear in the field? The more often, the more - relevant. A field containing five mentions of the same term is more likely - to be relevant than a field containing just one mention. +每个文档都会有相关性评分,用一个正浮点数 `_score` 来表示, `_scaore` 的评分越高,相关性越高。 -Inverse document frequency:: - How often does each term appear in the index? The more often, the _less_ - relevant. Terms that appear in many documents have a lower _weight_ than - more-uncommon terms. -Field-length norm:: +查询子句会为每个文档添加一个 `_score` 字段,评分的计算方式取决于不同的查询类型———不同的查询子句用于不同的查询目的。((("fuzzy queries", "calculation of relevence score"))) 一个 `fuzzy` +查询会计算与关键词的拼写相似程度, `terms` 查询会计算找到的内容于关键词组成部分匹配的百分比,但是一般意义上我们说的全文本搜索是指计算内容与关键词的类似程度。 - How long is the field? The longer it is, the less likely it is that words in - the field will be relevant. A term appearing in a short `title` field - carries more weight than the same term appearing in a long `content` field. -Individual ((("field-length norm")))queries may combine the TF/IDF score with other factors -such as the term proximity in phrase queries, or term similarity in -fuzzy queries. +Elasticsearch 的相似度算法被定义为 TF/IDF ,((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)")))即检索词频率/反向文档频率,包括((("inverse document frequency")))以下内容: -Relevance is not just about full-text search, though. It can equally be applied -to yes/no clauses, where the more clauses that match, the higher the -`_score`. -When multiple query clauses are combined using a compound query((("compound query clauses", "relevance score for results"))) like the -`bool` query, the `_score` from each of these query clauses is combined to -calculate the overall `_score` for the document. +检索词频率:: + + + 检索词在该字段出现的频率?出现频率越高,相关性越高。字段中出现5次相同的检索词要比只出现一次的相关性高。 + +反向文档频率:: + + 每个检索词在索引中出现的频率?出现的频率越高,相关性也越高。检索词出现在多数文档中的会比出现在少数文档中的权重更低,即检验一个检索词在文档中的普遍重要性。 + +字段长度准则:: + + + 字段的长度是多少?长度越长,相关性越低。检索词出现在一个短的 `title` 要比同样的词出现在一个长的 `content` 字段相关性更高。 + + +单个查询((("field-length norm")))可以使用 TF/IDF 评分标准或其他方式,比如在短语查询中检索词的距离或模糊查询中检索词的相似度。 + + + +虽然如此,相关性不仅仅关于全文搜索,也适用于 yes/no 子句, 匹配的字句越多,相关性评分越高。 + + + +当多条查询子句被合并为一条复合子句时,((("compound query clauses", "relevance score for results"))) 例如 `bool` 查询,则每个查询子句计算得出的得分会被合并到总的相关性评分中。 + + +TIP: 我们有了一整章关于相关性计算和如何使其按照你所希望的方式运作:<>. -TIP: We have a whole chapter dedicated to relevance calculations and how to -bend them to your will: <>. [[explain]] -==== Understanding the Score +==== 理解评分标准 + -When debugging a complex query,((("score", "calculation of")))((("relevance scores", "understanding"))) it can be difficult to understand -exactly how a `_score` has been calculated. Elasticsearch -has the option of producing an _explanation_ with every search result, -by setting the `explain` parameter((("explain parameter"))) to `true`. +当调试一个复杂的查询语句时, 想要理解相关性评分会比较困难。Elasticsearch在每个查询语句中都会生成 _explanation_ 选项,将 `explain` 参数设置为 `true` 就可以得到更详细的信息。 [source,js] -------------------------------------------------- @@ -69,18 +63,19 @@ GET /_search?explain <1> } -------------------------------------------------- // SENSE: 056_Sorting/90_Explain.json -<1> The `explain` parameter adds an explanation of how the `_score` was - calculated to every result. +<1> `explain` 参数 增加了对每个结果的 `_score` 评分是如何计算出来的。 [NOTE] ==== -Adding `explain` produces a lot((("explain parameter", "for relevance score calculation"))) of output for every hit, which can look -overwhelming, but it is worth taking the time to understand what it all means. -Don't worry if it doesn't all make sense now; you can refer to this section -when you need it. We'll work through the output for one `hit` bit by bit. + +增加一个 `explain` 参数会为每个匹配到的文档产生一大堆额外内容,但是花时间去理解它是有意义的。如果现在看不明白也没关系———等你需要的时候再来回顾这一节就行/夏眠我们来一点点地了解这块知识点。 + + ==== -First, we have the metadata that is returned on normal search requests: + +首先,我么看一下普通查询返回的元数据。 + [source,js] -------------------------------------------------- @@ -92,9 +87,10 @@ First, we have the metadata that is returned on normal search requests: "_source" : { ... trimmed ... }, -------------------------------------------------- -It adds information about the shard and the node that the document came from, -which is useful to know because term and document frequencies are calculated -per shard, rather than per index: + + +这里加入了文档来源的分片和节点的信息,这对我们是比较有帮助的,因为词频率和文档频率是在每个分片中计算出来的,而不是每个索引中。 + [source,js] -------------------------------------------------- @@ -102,10 +98,10 @@ per shard, rather than per index: "_node" : "mzIVYCsqSWCG_M_ZffSs9Q", -------------------------------------------------- -Then it provides the `_explanation`. Each ((("explanation of relevance score calculation")))((("description", "of relevance score calculations")))entry contains a `description` -that tells you what type of calculation is being performed, a `value` -that gives you the result of the calculation, and the `details` of any -subcalculations that were required: + + +然后返回值中的 `_explanation_` 会包含在每一个入口,((("explanation of relevance score calculation")))((("description", "of relevance score calculations")))告诉你采用了哪种计算方式,并让你知道计算结果和我们需要的其他详情。 + [source,js] -------------------------------------------------- @@ -141,55 +137,54 @@ subcalculations that were required: ] } -------------------------------------------------- -<1> Summary of the score calculation for `honeymoon` -<2> Term frequency -<3> Inverse document frequency -<4> Field-length norm +<1> `honeymoon` 相关性评分计算的总结 +<2> 检索词频率 +<3> 反向文档频率 +<4> 字段长度准则 + +WARNING: 输出 `explain` 的代价是昂贵的.((("explain parameter", "overhead of using"))) 它只能用作调试,而不要用于生产环境。 + + +第一部分是关于计算的总结。告诉了我们 文档 `0` 中`honeymoon` 在 `tweet` 字段中的检索词频率/反向文档频率 (TF/IDF)((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))。(这里的文档 `0` 是一个内部的ID,跟我们没有任何关系,可以忽略) -WARNING: Producing the `explain` output is expensive.((("explain parameter", "overhead of using"))) It is a debugging tool -only. Don't leave it turned on in production. -The first part is the summary of the calculation. It tells us that it has -calculated the _weight_—the ((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))TF/IDF--of the term `honeymoon` in the field `tweet`, for document `0`. (This is -an internal document ID and, for our purposes, can be ignored.) +然后给出了计算的权重计算出来的详情((("field-length norm")))((("inverse document frequency"))) 。 -It then provides details((("field-length norm")))((("inverse document frequency"))) of how the weight was calculated: -Term frequency:: +检索词频率:: - How many times did the term `honeymoon` appear in the `tweet` field in - this document? + 在本文档中检索词 `honeymoon` 在 `tweet` 字段中的出现次数。 -Inverse document frequency:: +反向文档频率:: - How many times did the term `honeymoon` appear in the `tweet` field - of all documents in the index? + 在本索引中, 本文档 `honeymoon` 在 `tweet` 字段出现次数和其他文档中出现总数的比率。 + + +字段长度准则:: + + 文档中 `tweet` 字段内容的长度——内容越长,其值越小 + + + +复杂的查询语句的解释也很复杂,但是包含的内容与上面例子大致相同。通过这段描述我们可以了解搜索结果的顺序是如何产生的,这些信息在我们调试时是无价的。 -Field-length norm:: - How long is the `tweet` field in this document? The longer the field, - the smaller this number. -Explanations for more-complicated queries can appear to be very complex, but -really they just contain more of the same calculations that appear in the -preceding example. This information can be invaluable for debugging why search -results appear in the order that they do. [TIP] ================================================================== -The output from `explain` can be difficult to read in JSON, but it is easier -when it is formatted as YAML.((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) Just add `format=yaml` to the query string. +json形式的 `explain` 会非常难以阅读, 但是转成yaml会好很多。((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) 仅仅需要在查询参数中增加 `format=yaml` 。 ================================================================== [[explain-api]] -==== Understanding Why a Document Matched +==== 理解文档是如何被匹配到的 + + +当 `explain` 选项加到某一文档上时,他会告诉你为何这个文档会被匹配,以及一个文档为何没有被匹配。((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched"))) -While the `explain` option adds an explanation for every result, you can use -the `explain` API to understand why one particular document matched or, more -important, why it _didn't_ match.((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched"))) -The path for the request is `/index/type/id/_explain`, as in the following: +请求路径为 `/index/type/id/_explain`, 如下所示: [source,js] -------------------------------------------------- @@ -205,14 +200,14 @@ GET /us/tweet/12/_explain -------------------------------------------------- // SENSE: 056_Sorting/90_Explain_API.json -Along with the full explanation((("description", "of why a document didn't match"))) that we saw previously, we also now have a -`description` element, which tells us this: +和我们之前看到的全部详情一起,我们现在有了一个 `element` 元素,并告知我们如下 [source,js] -------------------------------------------------- "failure to match filter: cache(user_id:[2 TO 2])" -------------------------------------------------- -In other words, our `user_id` filter clause is preventing the document from -matching. + + +换句话说,我们的 `user_id` 过滤器子句防止了文档被匹配到 From 424f2fd330bb2d462e5b03a6f5836743c413ed9e Mon Sep 17 00:00:00 2001 From: luotitan Date: Sun, 24 Jul 2016 00:16:58 +0800 Subject: [PATCH 59/95] =?UTF-8?q?=E7=AC=AC=E4=B8=80=E6=AC=A1=E6=8F=90?= =?UTF-8?q?=E4=BA=A4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../50_Scoring_fuzziness.asciidoc | 40 +++++++++---------- 1 file changed, 18 insertions(+), 22 deletions(-) diff --git a/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc index f45176495..56ed92df3 100644 --- a/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc +++ b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc @@ -1,33 +1,29 @@ [[fuzzy-scoring]] -=== Scoring Fuzziness +=== 模糊性评分 -Users love fuzzy queries. They assume that these queries will somehow magically find -the right combination of proper spellings.((("fuzzy queries", "scoring fuzziness")))((("typoes and misspellings", "scoring fuzziness")))((("relevance scores", "fuzziness and"))) Unfortunately, the truth is -somewhat more prosaic. -Imagine that we have 1,000 documents containing ``Schwarzenegger,'' and just -one document with the misspelling ``Schwarzeneger.'' According to the theory -of <>, the misspelling is -much more relevant than the correct spelling, because it appears in far fewer -documents! +用户喜欢模糊查询。他们认为这种查询会魔法般的找到正确拼写组合。 +((("fuzzy queries", "scoring fuzziness")))((("typoes and misspellings", "scoring fuzziness")))((("relevance scores", "fuzziness and"))) +很遗憾,实际效果平平。 -In other words, if we were to treat fuzzy matches((("match query", "fuzzy match query"))) like any other match, we -would favor misspellings over correct spellings, which would make for grumpy -users. -TIP: Fuzzy matching should not be used for scoring purposes--only to widen -the net of matching terms in case there are misspellings. +假设我们有1000个文档包含 ``Schwarzenegger'' ,只是一个文档的出现拼写错误 ``Schwarzeneger'' 。 +根据 <> 理论,这个拼写错误文档比拼写正确的相关度更高,因为它更少在文档中出现! + + +换句话说,如果我们对待模糊匹配((("match query", "fuzzy match query")))类似其他匹配方法,我们将偏爱错误的拼写超过了正确的拼写,这会让用户发狂。 + + +TIP: 模糊匹配不应用于参与评分--只能在有拼写错误时扩大匹配项的范围。 + + +默认情况下, `match` 查询给定所有的模糊匹配的恒定评分为1。这可以满足在结果列表的末尾添加潜在的匹配记录,并且没有干扰非模糊查询的相关性评分。 -By default, the `match` query gives all fuzzy matches the constant score of 1. -This is sufficient to add potential matches onto the end of the result list, -without interfering with the relevance scoring of nonfuzzy queries. [TIP] ================================================== -Fuzzy queries alone are much less useful than they initially appear. They are -better used as part of a ``bigger'' feature, such as the _search-as-you-type_ -{ref}/search-suggesters-completion.html[`completion` suggester] or the -_did-you-mean_ {ref}/search-suggesters-phrase.html[`phrase` suggester]. - +在模糊查询最初出现时很少能单独使用。他们更好的作为一个 ``bigger'' 场景的部分功能特性,如 _search-as-you-type_ +{ref}/search-suggesters-completion.html[`完成` 建议]或 +_did-you-mean_ {ref}/search-suggesters-phrase.html[`短语` 建议]。 ================================================== From eb46ca8a7efbf0552ad786cb5b8b754bb9716e0c Mon Sep 17 00:00:00 2001 From: luotitan Date: Mon, 25 Jul 2016 00:44:19 +0800 Subject: [PATCH 60/95] =?UTF-8?q?=E7=AC=AC=E4=B8=80=E6=AC=A1=E6=8F=90?= =?UTF-8?q?=E4=BA=A4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../60_Phonetic_matching.asciidoc | 92 ++++++++----------- 1 file changed, 40 insertions(+), 52 deletions(-) diff --git a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc index 6e2fd59b6..67bcb51ae 100644 --- a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc +++ b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc @@ -1,35 +1,28 @@ [[phonetic-matching]] -=== Phonetic Matching - -In a last, desperate, attempt to match something, anything, we could resort to -searching for words that sound similar, ((("typoes and misspellings", "phonetic matching")))((("phonetic matching")))even if their spelling differs. - -Several algorithms exist for converting words into a phonetic -representation.((("phonetic algorithms"))) The http://en.wikipedia.org/wiki/Soundex[Soundex] algorithm is -the granddaddy of them all, and most other phonetic algorithms are -improvements or specializations of Soundex, such as -http://en.wikipedia.org/wiki/Metaphone[Metaphone] and -http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone] -(which expands phonetic matching to languages other than English), -http://en.wikipedia.org/wiki/Caverphone[Caverphone] for matching names in New -Zealand, the -https://en.wikipedia.org/wiki/Daitch–Mokotoff_Soundex#Beider.E2.80.93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] algorithm, which adopts the Soundex algorithm -for better matching of German and Yiddish names, and the -http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] for better -handling of German words. - -The thing to take away from this list is that phonetic algorithms are fairly -crude, and ((("languages", "phonetic algorithms")))very specific to the languages they were designed for, usually -either English or German. This limits their usefulness. Still, for certain -purposes, and in combination with other techniques, phonetic matching can be a -useful tool. - -First, you will need to install ((("Phonetic Analysis plugin")))the Phonetic Analysis plug-in from -https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html on every node -in the cluster, and restart each node. - -Then, you can create a custom analyzer that uses one of the -phonetic token filters ((("phonetic matching", "creating a phonetic analyzer")))and try it out: +=== 语音匹配 + +最后,在尝试任何其他匹配方法都无效后,我们可以求助于搜索发音相似的词,即使他们的拼写不同。 + + +存在一些将词转换成语音标识的算法。 +((("phonetic algorithms"))) http://en.wikipedia.org/wiki/Soundex[Soundex] 算法是这些算法的鼻祖, +而且大多数语音算法是 Soundex 的改进或者专业版本,例如 http://en.wikipedia.org/wiki/Metaphone[Metaphone] +和 http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone] (扩展了除英语以外的其他语言的语音匹配), +http://en.wikipedia.org/wiki/Caverphone[Caverphone] 算法匹配了新西兰的名称, +https://en.wikipedia.org/wiki/Daitch–Mokotoff_Soundex#Beider.E2.80.93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] 算法吸收了 Soundex 算法为了更好的匹配德语和依地语名称, +http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] 为了更好的处理德语词汇。 + + +值得一提的是,语音算法是相当简陋的,((("languages", "phonetic algorithms")))他们设计初衷针对的语言通常是英语或德语。这限制了他们的实用性。 +不过,为了某些明确的目标,并与其他技术相结合,语音匹配能够作为一个有用的工具。 + + +首先,你将需要从 +https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html 获取在集群的每个节点安装((("Phonetic Analysis plugin")))语言分析器插件, +并且重启每个节点。 + + +然后,您可以创建一个使用语音语汇单元过滤器的自定义分析器,并尝试下面的方法: [source,json] ----------------------------------- @@ -53,12 +46,11 @@ PUT /my_index } } ----------------------------------- -<1> First, configure a custom `phonetic` token filter that uses the - `double_metaphone` encoder. -<2> Then use the custom token filter in a custom analyzer. +<1> 首先,配置一个自定义 `phonetic` 语汇单元过滤器并使用 `double_metaphone` 编码器。 +<2> 然后在自定义分析器中使用自定义语汇单元过滤器。 -Now we can test it with the `analyze` API: +现在我们可以通过 `analyze` API 来进行测试: [source,json] ----------------------------------- @@ -66,13 +58,13 @@ GET /my_index/_analyze?analyzer=dbl_metaphone Smith Smythe ----------------------------------- -Each of `Smith` and `Smythe` produce two tokens in the same position: `SM0` -and `XMT`. Running `John`, `Jon`, and `Johnnie` through the analyzer will all -produce the two tokens `JN` and `AN`, while `Jonathon` results in the tokens -`JN0N` and `ANTN`. -The phonetic analyzer can be used just like any other analyzer. First map a -field to use it, and then index some data: +每个 `Smith` 和 `Smythe` 在同一位置产生两个语汇单元: `SM0` 和 `XMT` 。 +通过分析器播放 `John` , `Jon` 和 `Johnnie` 将产生两个语汇单元 `JN` 和 `AN` ,而 `Jonathon` 产生语汇单元 `JN0N` 和 `ANTN` 。 + + +语音分析器可以像任何其他分析器一样使用。 首先映射一个字段来使用它,然后索引一些数据: + [source,json] ----------------------------------- @@ -101,9 +93,10 @@ PUT /my_index/my_type/2 "name": "Jonnie Smythe" } ----------------------------------- -<1> The `name.phonetic` field uses the custom `dbl_metaphone` analyzer. +<1> `name.phonetic` 字段使用自定义 `dbl_metaphone` 分析器。 + -The `match` query can be used for searching: +可以使用 `match` 查询来进行搜索: [source,json] ----------------------------------- @@ -120,15 +113,10 @@ GET /my_index/my_type/_search } ----------------------------------- -This query returns both documents, demonstrating just how coarse phonetic -matching is. ((("phonetic matching", "purpose of"))) Scoring with a phonetic algorithm is pretty much worthless. The -purpose of phonetic matching is not to increase precision, but to increase -recall--to spread the net wide enough to catch any documents that might -possibly match.((("recall", "increasing with phonetic matching"))) - -It usually makes more sense to use phonetic algorithms when retrieving results -which will be consumed and post-processed by another computer, rather than by -human users. +这个查询返回全部两个文档,演示了如何进行简陋的语音匹配。 +((("phonetic matching", "purpose of"))) 用语音算法计算评分是没有价值的。 +语音匹配的目的不是为了提高精度,而是要提高召回率--以扩展足够的范围来捕获可能匹配的文档。 +通常是更有意义的使用语音算法是在检索到结果后,由另一台计算机进行消费和后续处理,而不是由人类用户直接使用。 From d818ea478c7aa91c3bd51e9c6babdd0f7888d64d Mon Sep 17 00:00:00 2001 From: Golden Looly Date: Tue, 26 Jul 2016 09:39:31 +0800 Subject: [PATCH 61/95] Revert "chapter8_part4: /056_Sorting/95_Fielddata.asciidoc" --- 056_Sorting/95_Fielddata.asciidoc | 71 ++++++++++++++++--------------- 1 file changed, 37 insertions(+), 34 deletions(-) diff --git a/056_Sorting/95_Fielddata.asciidoc b/056_Sorting/95_Fielddata.asciidoc index 1c3abdd22..10ba6a947 100644 --- a/056_Sorting/95_Fielddata.asciidoc +++ b/056_Sorting/95_Fielddata.asciidoc @@ -1,51 +1,54 @@ -[[字段数据介绍]] -=== 字段数据 +[[fielddata-intro]] +=== Fielddata +Our final topic in this chapter is about an internal aspect of Elasticsearch. +While we don't demonstrate any new techniques here, fielddata is an +important topic that we will refer to repeatedly, and is something that you +should be aware of.((("fielddata"))) +When you sort on a field, Elasticsearch needs access to the value of that +field for every document that matches the query.((("inverted index", "sorting and"))) The inverted index, which +performs very well when searching, is not the ideal structure for sorting on +field values: -我们这章的终极目标是关于Elasticsearch的一个内部的方面,且我们在这里并不会阐述任何新的技术,字段数据是我们将会重复提到的一个重要话题,并且你应当明确它。((("fielddata"))) +* When searching, we need to be able to map a term to a list of documents. +* When sorting, we need to map a document to its terms. In other words, we + need to ``uninvert'' the inverted index. +To make sorting efficient, Elasticsearch loads all the values for +the field that you want to sort on into memory. This is referred to as +_fielddata_. -当你以字段进行排序, Elasticsearch需要访问符合查询的每个文档的该字段的值。((("inverted index", "sorting and")))反转的索引(这会对搜索更加友好)在以字段值排序时不是理想的结构。 +WARNING: Elasticsearch doesn't just load the values for the documents that matched a +particular query. It loads the values from _every document in your index_, +regardless of the document `type`. +The reason that Elasticsearch loads all values into memory is that uninverting the index +from disk is slow. Even though you may need the values for only a few docs +for the current request, you will probably need access to the values for other +docs on the next request, so it makes sense to load all the values into memory +at once, and to keep them there. -* 当搜索时,我们需要能将一个文档列表映射到某一项上。 +Fielddata is used in several places in Elasticsearch: +* Sorting on a field +* Aggregations on a field +* Certain filters (for example, geolocation filters) +* Scripts that refer to fields +Clearly, this can consume a lot of memory, especially for high-cardinality +string fields--string fields that have many unique values--like the body +of an email. Fortunately, insufficient memory is a problem that can be solved +by horizontal scaling, by adding more nodes to your cluster. -* 当排序时, 我们需要映射一个文档到它的某项。 换句话说, 我们需要 ``反向反转`` 已经反转的索引。 +For now, all you need to know is what fielddata is, and to be aware that it +can be memory hungry. Later, we will show you how to determine the amount of memory that fielddata +is using, how to limit the amount of memory that is available to it, and +how to preload fielddata to improve the user experience. -为了使得排序效率更高, Elasticsearch 会在内存中加载你想要以之排序的所有字段的值。 这便是提到的 _字段数据_ 。 - - - - -WARNING: Elasticsearch 并不仅仅加载匹配特定查询的文档的值。 他会加载 _你的数据库中的每个文档_ , 无论这个文档的 `type` - - - - -Elasticsearch在内存中加载所有的值的原因是在硬盘中逆反向索引是很慢的。虽然你当前的请求可能仅仅需要很少文档的值,你仍然可能在下次请求时需要可以访问其他文档的值,所以在内存中立即加载所有的值并驻留是有意义的。 - - - - -字段数据在Elasticsearch中被用于以下地方: - -* 按照字段排序 -* 按照字段聚合 -* 一些特定的筛选(例如,地理筛选) -* 引入字段的脚本 - - -显然的,这会消耗大量的内存,特别是对于高基数的字符串字段--字符串字段有很多独特的值--例如email的body体。幸运的是,内存效率低的问题可以通过增加集群的节点进行水平扩展来解决。 - -现在,所有你需要知道和明确的是它是极度需要内存的。稍后,我们会给你演示如何确定字段数据所占用的内存,如何限制可用的内存,和如何预加载字段数据来提高用户体验。 - - From 8a7efcaa419725a5b6b21a60bf43ec4f17715025 Mon Sep 17 00:00:00 2001 From: Golden Looly Date: Tue, 26 Jul 2016 09:40:26 +0800 Subject: [PATCH 62/95] Revert "chapter24_part6: /270_Fuzzy_matching/60_Phonetic_matching.asciidoc" --- .../60_Phonetic_matching.asciidoc | 92 +++++++++++-------- 1 file changed, 52 insertions(+), 40 deletions(-) diff --git a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc index 67bcb51ae..6e2fd59b6 100644 --- a/270_Fuzzy_matching/60_Phonetic_matching.asciidoc +++ b/270_Fuzzy_matching/60_Phonetic_matching.asciidoc @@ -1,28 +1,35 @@ [[phonetic-matching]] -=== 语音匹配 - -最后,在尝试任何其他匹配方法都无效后,我们可以求助于搜索发音相似的词,即使他们的拼写不同。 - - -存在一些将词转换成语音标识的算法。 -((("phonetic algorithms"))) http://en.wikipedia.org/wiki/Soundex[Soundex] 算法是这些算法的鼻祖, -而且大多数语音算法是 Soundex 的改进或者专业版本,例如 http://en.wikipedia.org/wiki/Metaphone[Metaphone] -和 http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone] (扩展了除英语以外的其他语言的语音匹配), -http://en.wikipedia.org/wiki/Caverphone[Caverphone] 算法匹配了新西兰的名称, -https://en.wikipedia.org/wiki/Daitch–Mokotoff_Soundex#Beider.E2.80.93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] 算法吸收了 Soundex 算法为了更好的匹配德语和依地语名称, -http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] 为了更好的处理德语词汇。 - - -值得一提的是,语音算法是相当简陋的,((("languages", "phonetic algorithms")))他们设计初衷针对的语言通常是英语或德语。这限制了他们的实用性。 -不过,为了某些明确的目标,并与其他技术相结合,语音匹配能够作为一个有用的工具。 - - -首先,你将需要从 -https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html 获取在集群的每个节点安装((("Phonetic Analysis plugin")))语言分析器插件, -并且重启每个节点。 - - -然后,您可以创建一个使用语音语汇单元过滤器的自定义分析器,并尝试下面的方法: +=== Phonetic Matching + +In a last, desperate, attempt to match something, anything, we could resort to +searching for words that sound similar, ((("typoes and misspellings", "phonetic matching")))((("phonetic matching")))even if their spelling differs. + +Several algorithms exist for converting words into a phonetic +representation.((("phonetic algorithms"))) The http://en.wikipedia.org/wiki/Soundex[Soundex] algorithm is +the granddaddy of them all, and most other phonetic algorithms are +improvements or specializations of Soundex, such as +http://en.wikipedia.org/wiki/Metaphone[Metaphone] and +http://en.wikipedia.org/wiki/Metaphone#Double_Metaphone[Double Metaphone] +(which expands phonetic matching to languages other than English), +http://en.wikipedia.org/wiki/Caverphone[Caverphone] for matching names in New +Zealand, the +https://en.wikipedia.org/wiki/Daitch–Mokotoff_Soundex#Beider.E2.80.93Morse_Phonetic_Name_Matching_Algorithm[Beider-Morse] algorithm, which adopts the Soundex algorithm +for better matching of German and Yiddish names, and the +http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik[Kölner Phonetik] for better +handling of German words. + +The thing to take away from this list is that phonetic algorithms are fairly +crude, and ((("languages", "phonetic algorithms")))very specific to the languages they were designed for, usually +either English or German. This limits their usefulness. Still, for certain +purposes, and in combination with other techniques, phonetic matching can be a +useful tool. + +First, you will need to install ((("Phonetic Analysis plugin")))the Phonetic Analysis plug-in from +https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html on every node +in the cluster, and restart each node. + +Then, you can create a custom analyzer that uses one of the +phonetic token filters ((("phonetic matching", "creating a phonetic analyzer")))and try it out: [source,json] ----------------------------------- @@ -46,11 +53,12 @@ PUT /my_index } } ----------------------------------- -<1> 首先,配置一个自定义 `phonetic` 语汇单元过滤器并使用 `double_metaphone` 编码器。 -<2> 然后在自定义分析器中使用自定义语汇单元过滤器。 +<1> First, configure a custom `phonetic` token filter that uses the + `double_metaphone` encoder. +<2> Then use the custom token filter in a custom analyzer. +Now we can test it with the `analyze` API: -现在我们可以通过 `analyze` API 来进行测试: [source,json] ----------------------------------- @@ -58,13 +66,13 @@ GET /my_index/_analyze?analyzer=dbl_metaphone Smith Smythe ----------------------------------- +Each of `Smith` and `Smythe` produce two tokens in the same position: `SM0` +and `XMT`. Running `John`, `Jon`, and `Johnnie` through the analyzer will all +produce the two tokens `JN` and `AN`, while `Jonathon` results in the tokens +`JN0N` and `ANTN`. -每个 `Smith` 和 `Smythe` 在同一位置产生两个语汇单元: `SM0` 和 `XMT` 。 -通过分析器播放 `John` , `Jon` 和 `Johnnie` 将产生两个语汇单元 `JN` 和 `AN` ,而 `Jonathon` 产生语汇单元 `JN0N` 和 `ANTN` 。 - - -语音分析器可以像任何其他分析器一样使用。 首先映射一个字段来使用它,然后索引一些数据: - +The phonetic analyzer can be used just like any other analyzer. First map a +field to use it, and then index some data: [source,json] ----------------------------------- @@ -93,10 +101,9 @@ PUT /my_index/my_type/2 "name": "Jonnie Smythe" } ----------------------------------- -<1> `name.phonetic` 字段使用自定义 `dbl_metaphone` 分析器。 - +<1> The `name.phonetic` field uses the custom `dbl_metaphone` analyzer. -可以使用 `match` 查询来进行搜索: +The `match` query can be used for searching: [source,json] ----------------------------------- @@ -113,10 +120,15 @@ GET /my_index/my_type/_search } ----------------------------------- +This query returns both documents, demonstrating just how coarse phonetic +matching is. ((("phonetic matching", "purpose of"))) Scoring with a phonetic algorithm is pretty much worthless. The +purpose of phonetic matching is not to increase precision, but to increase +recall--to spread the net wide enough to catch any documents that might +possibly match.((("recall", "increasing with phonetic matching"))) + +It usually makes more sense to use phonetic algorithms when retrieving results +which will be consumed and post-processed by another computer, rather than by +human users. -这个查询返回全部两个文档,演示了如何进行简陋的语音匹配。 -((("phonetic matching", "purpose of"))) 用语音算法计算评分是没有价值的。 -语音匹配的目的不是为了提高精度,而是要提高召回率--以扩展足够的范围来捕获可能匹配的文档。 -通常是更有意义的使用语音算法是在检索到结果后,由另一台计算机进行消费和后续处理,而不是由人类用户直接使用。 From 87d08dc565f90b5aef7e932a0e3368fd607678f7 Mon Sep 17 00:00:00 2001 From: Golden Looly Date: Tue, 26 Jul 2016 09:41:22 +0800 Subject: [PATCH 63/95] Revert "chapter24_part5: /270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc" --- .../50_Scoring_fuzziness.asciidoc | 40 ++++++++++--------- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc index 56ed92df3..f45176495 100644 --- a/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc +++ b/270_Fuzzy_matching/50_Scoring_fuzziness.asciidoc @@ -1,29 +1,33 @@ [[fuzzy-scoring]] -=== 模糊性评分 +=== Scoring Fuzziness +Users love fuzzy queries. They assume that these queries will somehow magically find +the right combination of proper spellings.((("fuzzy queries", "scoring fuzziness")))((("typoes and misspellings", "scoring fuzziness")))((("relevance scores", "fuzziness and"))) Unfortunately, the truth is +somewhat more prosaic. -用户喜欢模糊查询。他们认为这种查询会魔法般的找到正确拼写组合。 -((("fuzzy queries", "scoring fuzziness")))((("typoes and misspellings", "scoring fuzziness")))((("relevance scores", "fuzziness and"))) -很遗憾,实际效果平平。 +Imagine that we have 1,000 documents containing ``Schwarzenegger,'' and just +one document with the misspelling ``Schwarzeneger.'' According to the theory +of <>, the misspelling is +much more relevant than the correct spelling, because it appears in far fewer +documents! +In other words, if we were to treat fuzzy matches((("match query", "fuzzy match query"))) like any other match, we +would favor misspellings over correct spellings, which would make for grumpy +users. -假设我们有1000个文档包含 ``Schwarzenegger'' ,只是一个文档的出现拼写错误 ``Schwarzeneger'' 。 -根据 <> 理论,这个拼写错误文档比拼写正确的相关度更高,因为它更少在文档中出现! - - -换句话说,如果我们对待模糊匹配((("match query", "fuzzy match query")))类似其他匹配方法,我们将偏爱错误的拼写超过了正确的拼写,这会让用户发狂。 - - -TIP: 模糊匹配不应用于参与评分--只能在有拼写错误时扩大匹配项的范围。 - - -默认情况下, `match` 查询给定所有的模糊匹配的恒定评分为1。这可以满足在结果列表的末尾添加潜在的匹配记录,并且没有干扰非模糊查询的相关性评分。 +TIP: Fuzzy matching should not be used for scoring purposes--only to widen +the net of matching terms in case there are misspellings. +By default, the `match` query gives all fuzzy matches the constant score of 1. +This is sufficient to add potential matches onto the end of the result list, +without interfering with the relevance scoring of nonfuzzy queries. [TIP] ================================================== -在模糊查询最初出现时很少能单独使用。他们更好的作为一个 ``bigger'' 场景的部分功能特性,如 _search-as-you-type_ -{ref}/search-suggesters-completion.html[`完成` 建议]或 -_did-you-mean_ {ref}/search-suggesters-phrase.html[`短语` 建议]。 +Fuzzy queries alone are much less useful than they initially appear. They are +better used as part of a ``bigger'' feature, such as the _search-as-you-type_ +{ref}/search-suggesters-completion.html[`completion` suggester] or the +_did-you-mean_ {ref}/search-suggesters-phrase.html[`phrase` suggester]. + ================================================== From df1c138c453e76b6f3a2128900ee0216ee2fbf63 Mon Sep 17 00:00:00 2001 From: Medcl Date: Fri, 29 Jul 2016 12:17:09 +0800 Subject: [PATCH 64/95] Revert "chapter8_part3: /056_Sorting/90_What_is_relevance.asciidoc" --- 056_Sorting/90_What_is_relevance.asciidoc | 179 +++++++++++----------- 1 file changed, 92 insertions(+), 87 deletions(-) diff --git a/056_Sorting/90_What_is_relevance.asciidoc b/056_Sorting/90_What_is_relevance.asciidoc index d0c759336..993b29edc 100644 --- a/056_Sorting/90_What_is_relevance.asciidoc +++ b/056_Sorting/90_What_is_relevance.asciidoc @@ -1,59 +1,65 @@ -[[相关性简介]] -=== 什么是相关性? +[[relevance-intro]] +=== What Is Relevance? +We've mentioned that, by default, results are returned in descending order of +relevance.((("relevance", "defined"))) But what is relevance? How is it calculated? +The relevance score of each document is represented by a positive floating-point number called the `_score`.((("score", "calculation of"))) The higher the `_score`, the more relevant +the document. -我们曾经讲过,默认情况下,返回结果是按照相关性倒序排序的,((("relevance", "defined")))但是什么是相关性?相关性如何计算 -? +A query clause generates a `_score` for each document. How that score is +calculated depends on the type of query clause.((("fuzzy queries", "calculation of relevence score"))) Different query clauses are +used for different purposes: a `fuzzy` query might determine the `_score` by +calculating how similar the spelling of the found word is to the original +search term; a `terms` query would incorporate the percentage of terms that +were found. However, what we usually mean by _relevance_ is the algorithm that we +use to calculate how similar the contents of a full-text field are to a full-text query string. +The standard _similarity algorithm_ used in Elasticsearch is((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)"))) known as _term +frequency/inverse document frequency_, or _TF/IDF_, which takes the following +factors into((("inverse document frequency"))) account: +Term frequency:: -每个文档都会有相关性评分,用一个正浮点数 `_score` 来表示, `_scaore` 的评分越高,相关性越高。 + How often does the term appear in the field? The more often, the more + relevant. A field containing five mentions of the same term is more likely + to be relevant than a field containing just one mention. +Inverse document frequency:: + How often does each term appear in the index? The more often, the _less_ + relevant. Terms that appear in many documents have a lower _weight_ than + more-uncommon terms. -查询子句会为每个文档添加一个 `_score` 字段,评分的计算方式取决于不同的查询类型———不同的查询子句用于不同的查询目的。((("fuzzy queries", "calculation of relevence score"))) 一个 `fuzzy` -查询会计算与关键词的拼写相似程度, `terms` 查询会计算找到的内容于关键词组成部分匹配的百分比,但是一般意义上我们说的全文本搜索是指计算内容与关键词的类似程度。 +Field-length norm:: + How long is the field? The longer it is, the less likely it is that words in + the field will be relevant. A term appearing in a short `title` field + carries more weight than the same term appearing in a long `content` field. -Elasticsearch 的相似度算法被定义为 TF/IDF ,((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("similarity algorithms", "Term Frequency/Inverse Document Frequency (TF/IDF)")))即检索词频率/反向文档频率,包括((("inverse document frequency")))以下内容: +Individual ((("field-length norm")))queries may combine the TF/IDF score with other factors +such as the term proximity in phrase queries, or term similarity in +fuzzy queries. +Relevance is not just about full-text search, though. It can equally be applied +to yes/no clauses, where the more clauses that match, the higher the +`_score`. -检索词频率:: - - - 检索词在该字段出现的频率?出现频率越高,相关性越高。字段中出现5次相同的检索词要比只出现一次的相关性高。 - -反向文档频率:: - - 每个检索词在索引中出现的频率?出现的频率越高,相关性也越高。检索词出现在多数文档中的会比出现在少数文档中的权重更低,即检验一个检索词在文档中的普遍重要性。 - -字段长度准则:: - - - 字段的长度是多少?长度越长,相关性越低。检索词出现在一个短的 `title` 要比同样的词出现在一个长的 `content` 字段相关性更高。 - - -单个查询((("field-length norm")))可以使用 TF/IDF 评分标准或其他方式,比如在短语查询中检索词的距离或模糊查询中检索词的相似度。 - - - -虽然如此,相关性不仅仅关于全文搜索,也适用于 yes/no 子句, 匹配的字句越多,相关性评分越高。 - - - -当多条查询子句被合并为一条复合子句时,((("compound query clauses", "relevance score for results"))) 例如 `bool` 查询,则每个查询子句计算得出的得分会被合并到总的相关性评分中。 - - -TIP: 我们有了一整章关于相关性计算和如何使其按照你所希望的方式运作:<>. +When multiple query clauses are combined using a compound query((("compound query clauses", "relevance score for results"))) like the +`bool` query, the `_score` from each of these query clauses is combined to +calculate the overall `_score` for the document. +TIP: We have a whole chapter dedicated to relevance calculations and how to +bend them to your will: <>. [[explain]] -==== 理解评分标准 - +==== Understanding the Score +When debugging a complex query,((("score", "calculation of")))((("relevance scores", "understanding"))) it can be difficult to understand +exactly how a `_score` has been calculated. Elasticsearch +has the option of producing an _explanation_ with every search result, +by setting the `explain` parameter((("explain parameter"))) to `true`. -当调试一个复杂的查询语句时, 想要理解相关性评分会比较困难。Elasticsearch在每个查询语句中都会生成 _explanation_ 选项,将 `explain` 参数设置为 `true` 就可以得到更详细的信息。 [source,js] -------------------------------------------------- @@ -63,19 +69,18 @@ GET /_search?explain <1> } -------------------------------------------------- // SENSE: 056_Sorting/90_Explain.json -<1> `explain` 参数 增加了对每个结果的 `_score` 评分是如何计算出来的。 +<1> The `explain` parameter adds an explanation of how the `_score` was + calculated to every result. [NOTE] ==== - -增加一个 `explain` 参数会为每个匹配到的文档产生一大堆额外内容,但是花时间去理解它是有意义的。如果现在看不明白也没关系———等你需要的时候再来回顾这一节就行/夏眠我们来一点点地了解这块知识点。 - - +Adding `explain` produces a lot((("explain parameter", "for relevance score calculation"))) of output for every hit, which can look +overwhelming, but it is worth taking the time to understand what it all means. +Don't worry if it doesn't all make sense now; you can refer to this section +when you need it. We'll work through the output for one `hit` bit by bit. ==== - -首先,我么看一下普通查询返回的元数据。 - +First, we have the metadata that is returned on normal search requests: [source,js] -------------------------------------------------- @@ -87,10 +92,9 @@ GET /_search?explain <1> "_source" : { ... trimmed ... }, -------------------------------------------------- - - -这里加入了文档来源的分片和节点的信息,这对我们是比较有帮助的,因为词频率和文档频率是在每个分片中计算出来的,而不是每个索引中。 - +It adds information about the shard and the node that the document came from, +which is useful to know because term and document frequencies are calculated +per shard, rather than per index: [source,js] -------------------------------------------------- @@ -98,10 +102,10 @@ GET /_search?explain <1> "_node" : "mzIVYCsqSWCG_M_ZffSs9Q", -------------------------------------------------- - - -然后返回值中的 `_explanation_` 会包含在每一个入口,((("explanation of relevance score calculation")))((("description", "of relevance score calculations")))告诉你采用了哪种计算方式,并让你知道计算结果和我们需要的其他详情。 - +Then it provides the `_explanation`. Each ((("explanation of relevance score calculation")))((("description", "of relevance score calculations")))entry contains a `description` +that tells you what type of calculation is being performed, a `value` +that gives you the result of the calculation, and the `details` of any +subcalculations that were required: [source,js] -------------------------------------------------- @@ -137,54 +141,55 @@ GET /_search?explain <1> ] } -------------------------------------------------- -<1> `honeymoon` 相关性评分计算的总结 -<2> 检索词频率 -<3> 反向文档频率 -<4> 字段长度准则 - -WARNING: 输出 `explain` 的代价是昂贵的.((("explain parameter", "overhead of using"))) 它只能用作调试,而不要用于生产环境。 - - -第一部分是关于计算的总结。告诉了我们 文档 `0` 中`honeymoon` 在 `tweet` 字段中的检索词频率/反向文档频率 (TF/IDF)((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))。(这里的文档 `0` 是一个内部的ID,跟我们没有任何关系,可以忽略) +<1> Summary of the score calculation for `honeymoon` +<2> Term frequency +<3> Inverse document frequency +<4> Field-length norm +WARNING: Producing the `explain` output is expensive.((("explain parameter", "overhead of using"))) It is a debugging tool +only. Don't leave it turned on in production. -然后给出了计算的权重计算出来的详情((("field-length norm")))((("inverse document frequency"))) 。 +The first part is the summary of the calculation. It tells us that it has +calculated the _weight_—the ((("weight", "calculation of")))((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm", "weight calculation for a term")))TF/IDF--of the term `honeymoon` in the field `tweet`, for document `0`. (This is +an internal document ID and, for our purposes, can be ignored.) +It then provides details((("field-length norm")))((("inverse document frequency"))) of how the weight was calculated: -检索词频率:: +Term frequency:: - 在本文档中检索词 `honeymoon` 在 `tweet` 字段中的出现次数。 + How many times did the term `honeymoon` appear in the `tweet` field in + this document? -反向文档频率:: +Inverse document frequency:: - 在本索引中, 本文档 `honeymoon` 在 `tweet` 字段出现次数和其他文档中出现总数的比率。 - - -字段长度准则:: - - 文档中 `tweet` 字段内容的长度——内容越长,其值越小 - - - -复杂的查询语句的解释也很复杂,但是包含的内容与上面例子大致相同。通过这段描述我们可以了解搜索结果的顺序是如何产生的,这些信息在我们调试时是无价的。 + How many times did the term `honeymoon` appear in the `tweet` field + of all documents in the index? +Field-length norm:: + How long is the `tweet` field in this document? The longer the field, + the smaller this number. +Explanations for more-complicated queries can appear to be very complex, but +really they just contain more of the same calculations that appear in the +preceding example. This information can be invaluable for debugging why search +results appear in the order that they do. [TIP] ================================================================== -json形式的 `explain` 会非常难以阅读, 但是转成yaml会好很多。((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) 仅仅需要在查询参数中增加 `format=yaml` 。 +The output from `explain` can be difficult to read in JSON, but it is easier +when it is formatted as YAML.((("explain parameter", "formatting output in YAML")))((("YAML, formatting explain output in"))) Just add `format=yaml` to the query string. ================================================================== [[explain-api]] -==== 理解文档是如何被匹配到的 - - -当 `explain` 选项加到某一文档上时,他会告诉你为何这个文档会被匹配,以及一个文档为何没有被匹配。((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched"))) +==== Understanding Why a Document Matched +While the `explain` option adds an explanation for every result, you can use +the `explain` API to understand why one particular document matched or, more +important, why it _didn't_ match.((("relevance", "understanding why a document matched")))((("explain API, understanding why a document matched"))) -请求路径为 `/index/type/id/_explain`, 如下所示: +The path for the request is `/index/type/id/_explain`, as in the following: [source,js] -------------------------------------------------- @@ -200,14 +205,14 @@ GET /us/tweet/12/_explain -------------------------------------------------- // SENSE: 056_Sorting/90_Explain_API.json +Along with the full explanation((("description", "of why a document didn't match"))) that we saw previously, we also now have a +`description` element, which tells us this: -和我们之前看到的全部详情一起,我们现在有了一个 `element` 元素,并告知我们如下 [source,js] -------------------------------------------------- "failure to match filter: cache(user_id:[2 TO 2])" -------------------------------------------------- - - -换句话说,我们的 `user_id` 过滤器子句防止了文档被匹配到 +In other words, our `user_id` filter clause is preventing the document from +matching. From bbf4550f11b0eee394bbb1be289627fde3f6b10d Mon Sep 17 00:00:00 2001 From: Medcl Date: Fri, 29 Jul 2016 12:18:13 +0800 Subject: [PATCH 65/95] Revert "chapter8_part2: /056_Sorting/88_String_sorting.asciidoc" --- 056_Sorting/88_String_sorting.asciidoc | 56 +++++++++++++------------- 1 file changed, 27 insertions(+), 29 deletions(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 6d322b68e..8f57c4ad8 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -1,30 +1,28 @@ -[[多字段]] -=== 字符串排序与多字段 +[[multi-fields]] +=== String Sorting and Multifields +Analyzed string fields are also multivalue fields,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) but sorting on them seldom +gives you the results you want. If you analyze a string like `fine old art`, +it results in three terms. We probably want to sort alphabetically on the +first term, then the second term, and so forth, but Elasticsearch doesn't have this +information at its disposal at sort time. -被解析的字符串字段也是多值字段,((("strings", "sorting on string fields")))((("analyzed fields", "string fields")))((("sorting", "string sorting and multifields"))) 但是很少会按照你想要的方式进行排序。如果你想分析一个字符串,如 `fine old art` , -这包含3项。我们很坑想要按第一项的字母排序,然后按第二项的字母排序,诸如此类,但是Elasticsearch在排序过程中没有这样的信息。 +You could use the `min` and `max` sort modes (it uses `min` by default), but +that will result in sorting on either `art` or `old`, neither of which was the +intent. +In order to sort on a string field, that field should contain one term only: +the whole `not_analyzed` string.((("not_analyzed string fields", "sorting on"))) But of course we still need the field to be +`analyzed` in order to be able to query it as full text. -你可以使用 `min` 和 `max` 排序模式(默认是 `min` ),但是这会导致排序以 `art` 或是 `old` ,任何一个都不是所希望的 - - - -为了以字符串字段进行排序, 这个字段应仅包含一项: -整个 `not_analyzed` 字符串。((("not_analyzed string fields", "sorting on"))) 但是我们仍需要 `analyzed` 字段,这样才能以全文进行查询 - - - -一个简单的方法是用两种方式对同一个字符串进行索引,这将在文档中包括两个字段 : `analyzed` 用于搜索, `not_analyzed` 用于排序 - - - -但是保存相同的字符串两次在 `_source` 字段是浪费空间的。 -我们真正想要做的是传递一个 _单字段_ 但是 却用两种方式索引它。所有的 _core_field 类型 (strings, numbers, Booleans, dates) 接收一个 `字段s` 参数((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping"))) - -该参数允许你转化一个简单的映射如 - +The naive approach to indexing the same string in two ways would be to include +two separate fields in the document: one that is `analyzed` for searching, +and one that is `not_analyzed` for sorting. +But storing the same string twice in the `_source` field is waste of space. +What we really want to do is to pass in a _single field_ but to _index it in two different ways_. All of the _core_ field types (strings, numbers, +Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping")))that allows you to transform a +simple mapping like [source,js] -------------------------------------------------- @@ -34,7 +32,7 @@ } -------------------------------------------------- -为一个多字段映射如: +into a _multifield_ mapping like this: [source,js] -------------------------------------------------- @@ -51,12 +49,12 @@ -------------------------------------------------- // SENSE: 056_Sorting/88_Multifield.json -<1> `tweet` 主字段与之前的一样: 是一个 `analyzed` 全文字段。 -<2> 新的 `tweet.raw` 子字段是 `not_analyzed`. - - -现在, 至少我们已经重新索引了我们的数据,使用 `tweet` 字段用于搜索,`tweet.raw` 字段用于排序: +<1> The main `tweet` field is just the same as before: an `analyzed` full-text + field. +<2> The new `tweet.raw` subfield is `not_analyzed`. +Now, or at least as soon as we have reindexed our data, we can use the `tweet` +field for search and the `tweet.raw` field for sorting: [source,js] -------------------------------------------------- @@ -72,6 +70,6 @@ GET /_search -------------------------------------------------- // SENSE: 056_Sorting/88_Multifield.json -WARNING: 以全文 `analyzed` 字段排序会消耗大量的内存. See +WARNING: Sorting on a full-text `analyzed` field can use a lot of memory. See <> for more information. From 8eb75821b3f34015cc912ec5ad0a4c97699db593 Mon Sep 17 00:00:00 2001 From: medcl Date: Fri, 29 Jul 2016 12:38:11 +0800 Subject: [PATCH 66/95] fix sorting use old fielddata tag --- 056_Sorting/88_String_sorting.asciidoc | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/056_Sorting/88_String_sorting.asciidoc b/056_Sorting/88_String_sorting.asciidoc index 8f57c4ad8..db220ea1b 100644 --- a/056_Sorting/88_String_sorting.asciidoc +++ b/056_Sorting/88_String_sorting.asciidoc @@ -22,7 +22,7 @@ and one that is `not_analyzed` for sorting. But storing the same string twice in the `_source` field is waste of space. What we really want to do is to pass in a _single field_ but to _index it in two different ways_. All of the _core_ field types (strings, numbers, Booleans, dates) accept a `fields` parameter ((("mapping (types)", "transforming simple mapping to multifield mapping")))((("types", "core simple field types", "accepting fields parameter")))((("fields parameter")))((("multifield mapping")))that allows you to transform a -simple mapping like +simple mapping like: [source,js] -------------------------------------------------- @@ -71,5 +71,4 @@ GET /_search // SENSE: 056_Sorting/88_Multifield.json WARNING: Sorting on a full-text `analyzed` field can use a lot of memory. See -<> for more information. - +<> for more information. From 3a1d87c5e4bd62d1f7e683a10639b7d194445593 Mon Sep 17 00:00:00 2001 From: JessicaWon Date: Mon, 1 Aug 2016 02:56:54 -0700 Subject: [PATCH 67/95] 00_Intro is finished --- 060_Distributed_Search/00_Intro.asciidoc | 19 +++++++++++++++++++ .../05_Query_phase.asciidoc | 4 ++++ 2 files changed, 23 insertions(+) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index a6098a6c5..244247e87 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -32,3 +32,22 @@ But finding all matching documents is only half the story. Results from multiple shards must be combined into a single sorted list before the `search` API can return a ``page'' of results. For this reason, search is executed in a two-phase process called _query then fetch_. +[[分布式检索]] +== 分布式检索执行 + +在开始之前,我们先来讨论有关在分布式环境中检索是如何进行的。((("distributed search execution")))比我们之前在<>中讨论过的基础的_create-read-update-delete_ (CRUD)请求的((("CRUD (create-read-update-delete) operations")))较为简单。 + +.内容提示 +**** + +你有兴趣的话可以读一读这章,并不需要为了使用Elasticsearch而理解和记住所有的细节。 + +这章的阅读目的只为在脑海中形成服务运行的梗概以及了解信息的存放位置以便不时之需,但是不要被细节搞的云里雾里。 + +**** + +CRUD的操作处理一个单个的文档,此文档中有一个`_index`, `_type`和<>之间的特殊连接,其中<>的缺省值为`_id`。这意味着我们知道在集群中哪个分片存有此文档。 + +检索需要一个更为精细的模型因为我们不知道哪条文档会被命中:这些文档可能分布在集群的任何分片上。一条检索的请求需要参考我们感兴趣的所有索引中的每个分片复本,这样来确认索引中是否有任何匹配的文档。 + +定位所有的匹配文档仅仅是开始,不同分片的结果在`search`的API返回``page''结果前必须融合到一个单个的已分类列表中。正因为如此,检索执行通常两步走,先是_query,然后是fetch_。 diff --git a/060_Distributed_Search/05_Query_phase.asciidoc b/060_Distributed_Search/05_Query_phase.asciidoc index dde4256bc..01e7af1ea 100644 --- a/060_Distributed_Search/05_Query_phase.asciidoc +++ b/060_Distributed_Search/05_Query_phase.asciidoc @@ -6,12 +6,16 @@ the search locally and ((("priority queue")))builds a _priority queue_ of matchi .Priority Queue **** +== 搜索语句 +在最初阶段_query phase_时,((("distributed search execution", "query phase")))((("query phase of distributed search")))搜索是广播查询索引中的每一个分片复本,不管是主本还是副本。每个分片执行搜索本地,同时((("priority queue")))创建文档命中后的_priority queue_。 + A _priority queue_ is just a sorted list that holds the _top-n_ matching documents. The size of the priority queue depends on the pagination parameters `from` and `size`. For example, the following search request would require a priority queue big enough to hold 100 documents: +一个_priority queue_仅仅是一个执行过滤后列表 [source,js] -------------------------------------------------- GET /_search From b390183ad6dde6bc8524985ab6e7b98f3ca08ad4 Mon Sep 17 00:00:00 2001 From: luotitan Date: Mon, 8 Aug 2016 22:21:30 +0800 Subject: [PATCH 68/95] chapter24_part4: /270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc (#134) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 第一次提交 * 第二次修改提交 --- .../40_Fuzzy_match_query.asciidoc | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc index 2a32ef3a4..8dc1ac471 100644 --- a/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc +++ b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc @@ -1,7 +1,7 @@ [[fuzzy-match-query]] -=== Fuzzy match Query +=== 模糊匹配查询 -The `match` query supports ((("typoes and misspellings", "fuzzy match query")))((("match query", "fuzzy matching")))((("fuzzy matching", "match query")))fuzzy matching out of the box: +`match` 查询支持((("typoes and misspellings", "fuzzy match query")))((("match query", "fuzzy matching")))((("fuzzy matching", "match query")))开箱即用的模糊匹配: [source,json] ----------------------------------- @@ -19,11 +19,9 @@ GET /my_index/my_type/_search } ----------------------------------- -The query string is first analyzed, to produce the terms `[surprize, me]`, and -then each term is fuzzified using the specified `fuzziness`. +查询字符串首先进行分析,会产生词项 `[surprize, me]` ,并且每个词项根据指定的 `fuzziness` 进行模糊化。 -Similarly, the `multi_match` query also ((("multi_match queries", "fuzziness support")))supports `fuzziness`, but only when -executing with type `best_fields` or `most_fields`: +同样, `multi_match` 查询也((("multi_match queries", "fuzziness support")))支持 `fuzziness` ,但只有当执行查询时类型是 `best_fields` 或者 `most_fields` : [source,json] ----------------------------------- @@ -39,9 +37,6 @@ GET /my_index/my_type/_search } ----------------------------------- -Both the `match` and `multi_match` queries also support the `prefix_length` -and `max_expansions` parameters. - -TIP: Fuzziness works only with the basic `match` and `multi_match` queries. It -doesn't work with phrase matching, common terms, or `cross_fields` matches. +`match` 和 `multi_match` 查询都支持 `prefix_length` 和 `max_expansions` 参数。 +TIP: 模糊性(Fuzziness)只能在 `match` and `multi_match` 查询中使用。不能使用在短语匹配、常用词项或 `cross_fields` 匹配。 From d867e48b5854d6b50af5016f82bb9d6581944cf0 Mon Sep 17 00:00:00 2001 From: luotitan Date: Mon, 8 Aug 2016 22:24:02 +0800 Subject: [PATCH 69/95] chapter41_part3: /400_Relationships/20_Denormalization.asciidoc (#67) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 15_Application_joins.asciidoc * 第一次提交 --- 400_Relationships/20_Denormalization.asciidoc | 22 ++++++------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/400_Relationships/20_Denormalization.asciidoc b/400_Relationships/20_Denormalization.asciidoc index 9b72605f5..2a39c4e91 100644 --- a/400_Relationships/20_Denormalization.asciidoc +++ b/400_Relationships/20_Denormalization.asciidoc @@ -1,15 +1,11 @@ [[denormalization]] -=== Denormalizing Your Data +=== 非规范化你的数据 -The way to get the best search performance out of Elasticsearch is to use it -as it is intended, by((("relationships", "denormalizing your data")))((("denormalization", "denormalizing data at index time"))) -http://en.wikipedia.org/wiki/Denormalization[denormalizing] your data at index -time. Having redundant copies of data in each document that requires access to -it removes the need for joins. -If we want to be able to find a blog post by the name of the user who wrote it, -include the user's name in the blog-post document itself: +使用 Elasticsearch 得到最好的搜索性能的方法是有目的的通过在索引时进行非规范化 ((("relationships", "denormalizing your data")))((("denormalization", "denormalizing data at index time"))) +http://en.wikipedia.org/wiki/Denormalization[denormalizing]。对每个文档保持一定数量的冗余副本可以在需要访问时避免进行关联。 +如果我们希望能够通过某个用户姓名找到他写的博客文章,可以在博客文档中包含这个用户的姓名: [source,json] -------------------------------- @@ -30,10 +26,9 @@ PUT /my_index/blogpost/2 } } -------------------------------- -<1> Part of the user's data has been denormalized into the `blogpost` document. +<1> 这部分用户的字段数据已被冗余到 `blogpost` 文档中。 -Now, we can find blog posts about `relationships` by users called `John` -with a single query: +现在,我们通过单次查询就能够通过 `relationships` 找到用户 `John` 的博客文章。 [source,json] -------------------------------- @@ -50,7 +45,4 @@ GET /my_index/blogpost/_search } -------------------------------- -The advantage of data denormalization is speed. Because each document -contains all of the information that is required to determine whether it -matches the query, there is no need for expensive joins. - +数据非规范化的优点是速度快。因为每个文档都包含了所需的所有信息,当这些信息需要在查询进行匹配时,并不需要进行昂贵的联接操作。 From c283d6758af22dfe5db416012d9a2d4330daaf0d Mon Sep 17 00:00:00 2001 From: Richard Date: Mon, 8 Aug 2016 22:27:07 +0800 Subject: [PATCH 70/95] chapter16_part2: /130_Partial_Matching/05_Postcodes.asciidoc (#102) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 初译 --- 130_Partial_Matching/05_Postcodes.asciidoc | 25 ++++++++++------------ 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/130_Partial_Matching/05_Postcodes.asciidoc b/130_Partial_Matching/05_Postcodes.asciidoc index d3a47a907..18b27e22f 100644 --- a/130_Partial_Matching/05_Postcodes.asciidoc +++ b/130_Partial_Matching/05_Postcodes.asciidoc @@ -1,22 +1,19 @@ -=== Postcodes and Structured Data +=== 邮编与结构化数据 -We will use United Kingdom postcodes (postal codes in the United States) to illustrate how((("partial matching", "postcodes and structured data"))) to use partial matching with -structured data. UK postcodes have a well-defined structure. For instance, the -postcode `W1V 3DG` can((("postcodes (UK), partial matching with"))) be broken down as follows: +我们会使用美国目前使用的邮编形式(United Kingdom postcodes 标准)来说明如何用部分匹配查询结构化数据。((("partial matching", "postcodes and structured data")))这种邮编形式有很好的结构定义。例如,邮编 `W1V 3DG` 可以分解成如下形式:((("postcodes (UK), partial matching with"))) -* `W1V`: This outer part identifies the postal area and district: +* `W1V` :这是邮编的外部,它定义了邮件的区域和行政区: -** `W` indicates the area (one or two letters) -** `1V` indicates the district (one or two numbers, possibly followed by a letter) +** `W` 代表区域( 1 或 2 个字母) +** `1V` 代表行政区( 1 或 2 个数字,可能跟着一个字符) -* `3DG`: This inner part identifies a street or building: +* `3DG` :内部定义了街道或建筑: -** `3` indicates the sector (one number) -** `DG` indicates the unit (two letters) +** `3` 代表街区区块( 1 个数字) +** `DG` 代表单元( 2 个字母) -Let's assume that we are indexing postcodes as exact-value `not_analyzed` -fields, so we could create our index as follows: +假设将邮编作为 `not_analyzed` 的精确值字段索引,所以可以为其创建索引,如下: [source,js] -------------------------------------------------- @@ -36,7 +33,7 @@ PUT /my_index -------------------------------------------------- // SENSE: 130_Partial_Matching/10_Prefix_query.json -And index some ((("indexing", "postcodes")))postcodes: +然后索引一些邮编:((("indexing", "postcodes"))) [source,js] -------------------------------------------------- @@ -57,4 +54,4 @@ PUT /my_index/address/5 -------------------------------------------------- // SENSE: 130_Partial_Matching/10_Prefix_query.json -Now our data is ready to be queried. +现在这些数据已可查询。 From 91aa123eca519059e407a218e28d55dcc1352499 Mon Sep 17 00:00:00 2001 From: "feng.wei" Date: Thu, 1 Sep 2016 14:02:11 +0800 Subject: [PATCH 71/95] translate chapter/chapter05_part1 --- 050_Search/05_Empty_search.asciidoc | 54 +++--------- 050_Search/10_Multi_index_multi_type.asciidoc | 40 ++++----- 050_Search/15_Pagination.asciidoc | 41 +++------- 050_Search/20_Query_string.asciidoc | 82 ++++++------------- 4 files changed, 62 insertions(+), 155 deletions(-) diff --git a/050_Search/05_Empty_search.asciidoc b/050_Search/05_Empty_search.asciidoc index 25cb69a86..a12f507d1 100644 --- a/050_Search/05_Empty_search.asciidoc +++ b/050_Search/05_Empty_search.asciidoc @@ -1,17 +1,14 @@ [[empty-search]] === The Empty Search -The most basic form of the((("searching", "empty search")))((("empty search"))) search API is the _empty search_, which doesn't -specify any query but simply returns all documents in all indices in the -cluster: +搜索API的最基础的形式是没有指定任何查询的空搜索,它简单地返回集群中所有目录中的所有文档: [source,js] -------------------------------------------------- GET /_search -------------------------------------------------- -// SENSE: 050_Search/05_Empty_search.json -The response (edited for brevity) looks something like this: +返回的结果(为了解决编辑过的)像这种这样子: [source,js] -------------------------------------------------- @@ -48,66 +45,39 @@ The response (edited for brevity) looks something like this: ==== hits -The most important section of the response is `hits`, which((("searching", "empty search", "hits")))((("hits"))) contains the -`total` number of documents that matched our query, and a `hits` array -containing the first 10 of those matching documents--the results. +返回结果中最重的部分是 `hits` ,它包含与我们查询相匹配的文档总数 `total` ,并且一个 `hits` 数组包含所查询结果的前十个文档。 -Each result in the `hits` array contains the `_index`, `_type`, and `_id` of -the document, plus the `_source` field. This means that the whole document is -immediately available to us directly from the search results. This is unlike -other search engines, which return just the document ID, requiring you to fetch -the document itself in a separate step. +在 `hits` 数组中每个结果包含文档的 `_index` 、 `_type` 、 `_id` ,加上 `_source` 字段。这意味着我们可以直接从返回的搜索结果中使用整个文档。这不像其他的搜索引擎,仅仅返回文档的ID,获取对应的文档需要在单独的步骤。 -Each element also ((("score", "for empty search")))((("relevance scores")))has a `_score`. This is the _relevance score_, which is a -measure of how well the document matches the query. By default, results are -returned with the most relevant documents first; that is, in descending order -of `_score`. In this case, we didn't specify any query, so all documents are -equally relevant, hence the neutral `_score` of `1` for all results. +每个结果还有一个 `_score` ,这是衡量文档与查询匹配度的关联性分数。默认情况下,首先返回最相关的文档结果,就是说,返回的文档是按照 `_score` 降序排列的。在这个例子中,我们没有指定任何查询,故所有的文档具有相同的相关性,因此对所有的结果而言 `1` 是中性的 `_score` 。 -The `max_score` value is the highest `_score` of any document that matches our -query.((("max_score value"))) +`max_score` 值是与查询所匹配文档的最高 `_score` 。 ==== took -The `took` value((("took value (empty search)"))) tells us how many milliseconds the entire search request took -to execute. +`took` 值告诉我们执行整个搜索请求耗费了多少毫秒。 ==== shards -The `_shards` element((("shards", "number involved in an empty search"))) tells us the `total` number of shards that were involved -in the query and,((("failed shards (in a search)")))((("successful shards (in a search)"))) of them, how many were `successful` and how many `failed`. -We wouldn't normally expect shards to fail, but it can happen. If we were to -suffer a major disaster in which we lost both the primary and the replica copy -of the same shard, there would be no copies of that shard available to respond -to search requests. In this case, Elasticsearch would report the shard as -`failed`, but continue to return results from the remaining shards. +`_shards` 部分告诉我们在查询中参与分片的总数,以及这些分片成功了多少个失败了多少个。正常情况下我们不希望分片失败,但是分片失败是可能发生的。如果我们遭遇到一种较常见的灾难,在这个灾难中丢失了相同分片的原始数据和副本,那么对这个分片将没有可用副本来对搜索请求作出响应。假若这样,Elasticsearch 将报告这个分片是失败的,但是会继续返回剩余分片的结果。 ==== timeout -The `timed_out` value tells((("timed_out value in search results"))) us whether the query timed out. By -default, search requests do not time out.((("timeout parameter", "specifying in a request"))) If low response times are more -important to you than complete results, you can specify a `timeout` as `10` -or `10ms` (10 milliseconds), or `1s` (1 second): +`timed_out` 值告诉我们查询是否超时。默认情况下,搜索请求不会超时。如果低响应时间比完成结果更重要,你可以指定 `timeout` 为10或者10ms(10毫秒),或者1s(1秒): [source,js] -------------------------------------------------- GET /_search?timeout=10ms -------------------------------------------------- - -Elasticsearch will return any results that it has managed to gather from -each shard before the requests timed out. +在请求超时之前,Elasticsearch 将返回从每个分片聚集来的结果。 [WARNING] ================================================ -It should be noted that this `timeout` does not((("timeout parameter", "not halting query execution"))) halt the execution of the -query; it merely tells the coordinating node to return the results collected -_so far_ and to close the connection. In the background, other shards may -still be processing the query even though results have been sent. +应当注意的是 `timeout` 不是停止执行查询,它仅仅是告知正在协调的节点返回到目前为止收集的结果并且关闭连接。在后台,其他的分片可能仍在执行查询即使是结果已经被发送了。 -Use the time-out because it is important to your SLA, not because you want -to abort the execution of long-running queries. +使用超时是因为对你的SLA是重要的,不是因为想去中止长时间运行的查询。 ================================================ diff --git a/050_Search/10_Multi_index_multi_type.asciidoc b/050_Search/10_Multi_index_multi_type.asciidoc index d865bff0d..8024251fd 100644 --- a/050_Search/10_Multi_index_multi_type.asciidoc +++ b/050_Search/10_Multi_index_multi_type.asciidoc @@ -1,54 +1,42 @@ [[multi-index-multi-type]] -=== Multi-index, Multitype +=== 多索引,多类型 -Did you notice that the results from the preceding <> -contained documents ((("searching", "multi-index, multi-type search")))of different types—`user` and `tweet`—from two -different indices—`us` and `gb`? +你有没有注意到之前的 <> 的结果包含从两个不同索引下 — `us` and `gb` 的不同类型 `user` and `tweet` 的文档? -By not limiting our search to a particular index or type, we have searched -across _all_ documents in the cluster. Elasticsearch forwarded the search -request in parallel to a primary or replica of every shard in the cluster, -gathered the results to select the overall top 10, and returned them to us. +如果不对某一特殊的索引或者类型做限制性的搜索,就会搜索集群中的所有文档。Elasticsearch 转发搜索请求到每一个主分片或者副本分片,汇集查询出的前10个结果,并且返回给我们。 -Usually, however, you will((("types", "specifying in search requests")))((("indices", "specifying in search requests"))) want to search within one or more specific indices, -and probably one or more specific types. We can do this by specifying the -index and type in the URL, as follows: +然而,经常的情况下,你想在一个或多个特殊的索引并且在一个或者多个特殊的类型中进行搜索。我们可以通过在URL中指定特殊的索引和类型达到这种效果,如下所示: `/_search`:: - Search all types in all indices + 在所有的索引中搜索所有的类型 `/gb/_search`:: - Search all types in the `gb` index + 在 `gb` 索引中搜索所有的类型 `/gb,us/_search`:: - Search all types in the `gb` and `us` indices + 在 `gb` 和 `us` 索引中搜索所有的文档 `/g*,u*/_search`:: - Search all types in any indices beginning with `g` or beginning with `u` + 在任何以 `g` 或者 `u` 开头的索引中搜索所有的类型 `/gb/user/_search`:: - Search type `user` in the `gb` index + 在 `gb` 索引中搜索 `user` 类型 `/gb,us/user,tweet/_search`:: - Search types `user` and `tweet` in the `gb` and `us` indices + 在 `gb` 和 `us` 索引中搜索 `user` 和 `tweet` 类型 `/_all/user,tweet/_search`:: - Search types `user` and `tweet` in all indices + 在所有的索引中搜索 `user` 和 `tweet` 类型 -When you search within a single index, Elasticsearch forwards the search -request to a primary or replica of every shard in that index, and then gathers the -results from each shard. Searching within multiple indices works in exactly -the same way--there are just more shards involved. +当在单一的索引下进行搜索的时候,Elasticsearch 转发请求到索引的每个分片中,可以是主分片也可以是副本分片,然后从每个分片中收集结果。多索引搜索恰好也是用相同的方式工作的--只是会涉及到更多的分片。 [TIP] ================================================ -Searching one index that has five primary shards is _exactly equivalent_ to -searching five indices that have one primary shard each. +搜索一个有五个主分片的索引和搜索只有一个主分片的五个索引准确来所说是等价的。 ================================================ -Later, you will see how this simple fact makes it easy to scale flexibly -as your requirements change. +最后,你将明白这种简单的方式如何弹性的把请求的变化变得简单化。 diff --git a/050_Search/15_Pagination.asciidoc b/050_Search/15_Pagination.asciidoc index 6123cf73b..8a8511bae 100644 --- a/050_Search/15_Pagination.asciidoc +++ b/050_Search/15_Pagination.asciidoc @@ -1,21 +1,17 @@ [[pagination]] -=== Pagination +=== 分页 -Our preceding <> told us that 14 documents in the((("pagination"))) -cluster match our (empty) query. But there were only 10 documents in -the `hits` array. How can we see the other documents? +在之前的 <> 中知道集群中有14个文档匹配了我们(empty)query。但是在 `hits` 数组中只有10个文档,怎么样我们才能看到其他的文档呢? -In the same way as SQL uses the `LIMIT` keyword to return a single ``page'' of -results, Elasticsearch accepts ((("from parameter")))((("size parameter")))the `from` and `size` parameters: +像SQL使用 `LIMIT` 关键字返回单页的结果一样,Elasticsearch 有 `from` 和 `size` 参数: `size`:: - Indicates the number of results that should be returned, defaults to `10` + 显示应该返回的结果数量,默认是 `10` `from`:: - Indicates the number of initial results that should be skipped, defaults to `0` + 显示应该跳过的初始结果数量,默认是 `0` -If you wanted to show five results per page, then pages 1 to 3 -could be requested as follows: +如果想每页展示五条结果,可以用下面三种方式请求: [source,js] -------------------------------------------------- @@ -26,30 +22,17 @@ GET /_search?size=5&from=10 // SENSE: 050_Search/15_Pagination.json -Beware of paging too deep or requesting too many results at once. Results are -sorted before being returned. But remember that a search request usually spans -multiple shards. Each shard generates its own sorted results, which then need -to be sorted centrally to ensure that the overall order is correct. +考虑到分页太深或者请求太多结果的情况,在返回结果之前可以对结果排序。但是请记住一个请求经常跨越多个分片,每个分片都产生自己的排序结果,这些结果需要进行集中排序以保证全部的次序是正确的。 -.Deep Paging in Distributed Systems +.在分布式系统中深度分页 **** -To understand why ((("deep paging, problems with")))deep paging is problematic, let's imagine that we are -searching within a single index with five primary shards. When we request the -first page of results (results 1 to 10), each shard produces its own top 10 -results and returns them to the _coordinating node_, which then sorts all 50 -results in order to select the overall top 10. +理解问什么深度分页是有问题的,我们可以想象搜索有五个主分片的单一索引。当我们请求结果的第一页(结果从1到10),每一个分片产生前10的结果,并且返回给起协调作用的节点,起协调作用的节点在对50个结果排序得到全部结果的前10个。 -Now imagine that we ask for page 1,000--results 10,001 to 10,010. Everything -works in the same way except that each shard has to produce its top 10,010 -results. The coordinating node then sorts through all 50,050 results and -discards 50,040 of them! +现在想象我们请求第1000页--结果从10001到10010。所有都以相同的方式工作除了每个分片不得不产生前10010个结果以外。然后起协调作用的节点对全部50050个结果排序最后丢弃掉这些结果中的50040个结果。 -You can see that, in a distributed system, the cost of sorting results -grows exponentially the deeper we page. There is a good reason -that web search engines don't return more than 1,000 results for any query. +看得出来,在分布式系统中,对结果排序的成本随分页的深度成指数上升。这就是为什么每次查询不要返回超过1000个结果的一个好理由。 **** -TIP: In <> we explain how you _can_ retrieve large numbers of -documents efficiently. +TIP: 在 <> 中我们解释了如何有效的获取大量的文档。 diff --git a/050_Search/20_Query_string.asciidoc b/050_Search/20_Query_string.asciidoc index f4340dab8..c89841bd5 100644 --- a/050_Search/20_Query_string.asciidoc +++ b/050_Search/20_Query_string.asciidoc @@ -1,14 +1,9 @@ [[search-lite]] === Search _Lite_ -There are two forms of the `search` API: a ``lite'' _query-string_ version -that expects all its((("searching", "query string searches")))((("query strings", "searching with"))) parameters to be passed in the query string, and the full -_request body_ version that expects a JSON request body and uses a -rich search language called the query DSL. +有两种搜索API的形式:一种精简查询-字符串版本在查询字符串中传递所有的参数,另一种功能全面的_request body_版本使用JSON格式并且使用一种名叫查询DSL的丰富搜索语言。 -The query-string search is useful for running ad hoc queries from the -command line. For instance, this query finds all documents of type `tweet` that -contain the word `elasticsearch` in the `tweet` field: +在命令行中查询-字符串搜索对运行特殊的查询是有益的。例如,查询在 `tweet` 类型中 `tweet` 字段包含 `elasticsearch` 单词的所有文档: [source,js] -------------------------------------------------- @@ -16,13 +11,11 @@ GET /_all/tweet/_search?q=tweet:elasticsearch -------------------------------------------------- // SENSE: 050_Search/20_Query_string.json -The next query looks for `john` in the `name` field and `mary` in the -`tweet` field. The actual query is just +下一个查询在 `name` 字段中包含 `john` 并且在 `tweet` 字段中包含 `mary` 的文档。实际的查询就是这样 +name:john +tweet:mary -but the _percent encoding_ needed for query-string parameters makes it appear -more cryptic than it really is: +但是查询-字符串参数所需要的百分比编码让它比实际上的更含义模糊: [source,js] -------------------------------------------------- @@ -31,15 +24,12 @@ GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary // SENSE: 050_Search/20_Query_string.json -The `+` prefix indicates conditions that _must_ be satisfied for our query to -match. Similarly a `-` prefix would indicate conditions that _must not_ -match. All conditions without a `+` or `-` are optional--the more that match, -the more relevant the document. +`+` 前缀表示必须与查询条件匹配。类似地, `-` 前缀表示一定不与查询条件匹配。没有 `+` 或者 `-` 的所有条件是可选的--匹配的越多,文档就越相关。 [[all-field-intro]] ==== The _all Field -This simple search returns all documents that contain the word `mary`: +这个简单搜索返回包含 `mary` 的所有文档: [source,js] -------------------------------------------------- @@ -48,19 +38,15 @@ GET /_search?q=mary // SENSE: 050_Search/20_All_field.json -In the previous examples, we searched for words in the `tweet` or -`name` fields. However, the results from this query mention `mary` in -three fields: +之前的例子中,我们在 `tweet` 和 `name` 字段中搜索内容。然而,这个查询的结果在三个地方提到了 `mary` : * A user whose name is Mary * Six tweets by Mary * One tweet directed at @mary -How has Elasticsearch managed to find results in three different fields? +Elasticsearch 是如何在三个不同的区域中查找到结果的呢? -When you index a document, Elasticsearch takes the string values of all of -its fields and concatenates them into one big string, which it indexes as -the special `_all` field.((("_all field", sortas="all field"))) For example, when we index this document: +当你索引一个文档的时候,Elasticsearch 取出所有字段的值拼接成一个大的字符串,作为 `_all` 字段进行索引。例如,当我们索引这个文档时: [source,js] -------------------------------------------------- @@ -73,7 +59,7 @@ the special `_all` field.((("_all field", sortas="all field"))) For example, whe -------------------------------------------------- -it's as if we had added an extra field called `_all` with this value: +这就好似增加了一个名叫 `_all` 的额外字段: [source,js] -------------------------------------------------- @@ -81,24 +67,19 @@ it's as if we had added an extra field called `_all` with this value: -------------------------------------------------- -The query-string search uses the `_all` field unless another -field name has been specified. +除非字段已经被指定,否则就使用 `_all` 字段进行搜索。 -TIP: The `_all` field is a useful feature while you are getting started with -a new application. Later, you will find that you have more control over -your search results if you query specific fields instead of the `_all` -field. When the `_all` field is no longer useful to you, you can -disable it, as explained in <>. +TIP: 在你刚开始使用 Elasticsearch 的时候, `_all` 字段是一个很实用的特征。之后,你会发现如果你在搜索的时候用指定字段来代替 `_all` 字段,对搜索出来的结果将有更好的控制。当 `_all` 字段对你不再有用的时候,你可以将它置为失效,向在 <> 中解释的。 [[query-string-query]] [role="pagebreak-before"] -==== More Complicated Queries +==== 更复杂的查询 -The next query searches for tweets, using the following criteria: +下面对tweents的查询,使用以下的条件: -* The `name` field contains `mary` or `john` -* The `date` is greater than `2014-09-10` -* The +_all+ field contains either of the words `aggregations` or `geo` +* `name` 字段中包含 `mary` 或者 `john` +* `date` 值大于 `2014-09-10` +* +_all_+ 字段包含 `aggregations` 或者 `geo` [source,js] -------------------------------------------------- @@ -106,39 +87,24 @@ The next query searches for tweets, using the following criteria: -------------------------------------------------- // SENSE: 050_Search/20_All_field.json -As a properly encoded query string, this looks like the slightly less -readable result: +适当编码过的查询字符串看起来有点晦涩难读: [source,js] -------------------------------------------------- ?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo) -------------------------------------------------- -As you can see from the preceding examples, this _lite_ query-string search is -surprisingly powerful.((("query strings", "syntax, reference for"))) Its query syntax, which is explained in detail in the -{ref}/query-dsl-query-string-query.html#query-string-syntax[Query String Syntax] -reference docs, allows us to express quite complex queries succinctly. This -makes it great for throwaway queries from the command line or during -development. +从之前的例子中可以看出,这种简化的查询-字符串的效果是非常惊人的。在相关参考文档中做出了详细解释的查询语法,让我们可以简洁的表达很复杂的查询。这对于命令行随机查询和在开发阶段都是很好的。 -However, you can also see that its terseness can make it cryptic and -difficult to debug. And it's fragile--a slight syntax error in the query -string, such as a misplaced `-`, `:`, `/`, or `"`, and it will return an error -instead of results. +然而,这种简洁的方式可能让排错变得模糊和困难。像 `-` , `:` , `/` 或者 `"` 不匹配这种易错的小语法问题将返回一个错误。 -Finally, the query-string search allows any user to run potentially slow, heavy -queries on any field in your index, possibly exposing private information or -even bringing your cluster to its knees! +最后,这种查询-字符串搜索可能在索引的任何字段中运行的非常缓慢、沉重,也有可能暴露私密信息甚至将集群至于危险之中。 [TIP] ================================================== -For these reasons, we don't recommend exposing query-string searches directly to -your users, unless they are power users who can be trusted with your data and -with your cluster. +因为这些原因,我们不推荐直接向用户暴露查询-字符串,除非这些用户对于集群和数据是可以被信任的。 + ================================================== -Instead, in production we usually rely on the full-featured _request body_ -search API, which does all of this, plus a lot more. Before we get there, -though, we first need to take a look at how our data is indexed in -Elasticsearch. +相反,我们经常在产品中更多的使用功能全面的 _request body_ 查询API。然而,在我们达到那种程度之前,我们首先需要了解数据在 Elasticsearch 中是如何索引的。 From 9cd3ac12f4d372a43447f72678e650bbed200a35 Mon Sep 17 00:00:00 2001 From: "feng.wei" Date: Thu, 1 Sep 2016 14:50:16 +0800 Subject: [PATCH 72/95] modify --- 050_Search/20_Query_string.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/050_Search/20_Query_string.asciidoc b/050_Search/20_Query_string.asciidoc index c89841bd5..813de7a1d 100644 --- a/050_Search/20_Query_string.asciidoc +++ b/050_Search/20_Query_string.asciidoc @@ -98,7 +98,7 @@ TIP: 在你刚开始使用 Elasticsearch 的时候, `_all` 字段是一个很 然而,这种简洁的方式可能让排错变得模糊和困难。像 `-` , `:` , `/` 或者 `"` 不匹配这种易错的小语法问题将返回一个错误。 -最后,这种查询-字符串搜索可能在索引的任何字段中运行的非常缓慢、沉重,也有可能暴露私密信息甚至将集群至于危险之中。 +字符串查询允许任何用户在索引的任意字段上运行既慢又重的查询,这些查询可能会暴露隐私信息或者将你的集群拖垮。 [TIP] ================================================== From 925873f9f452a72a9d14089bd578d6d18f4fb228 Mon Sep 17 00:00:00 2001 From: JessicaWon <476556993@qq.com> Date: Wed, 7 Sep 2016 20:21:06 -0700 Subject: [PATCH 73/95] 060_Distributed_Search --- .../05_Query_phase.asciidoc | 67 +++++-------------- 1 file changed, 17 insertions(+), 50 deletions(-) diff --git a/060_Distributed_Search/05_Query_phase.asciidoc b/060_Distributed_Search/05_Query_phase.asciidoc index 01e7af1ea..fe63293b1 100644 --- a/060_Distributed_Search/05_Query_phase.asciidoc +++ b/060_Distributed_Search/05_Query_phase.asciidoc @@ -1,21 +1,10 @@ -=== Query Phase +=== 搜索阶段 +在最初阶段 _query phase_ 时, ((("distributed search execution", "query phase"))) ((("query phase of distributed search"))) 搜索是广播查询索引中的每一个分片复本,不管是主本还是副本。每个分片执行本地查询,同时 ((("priority queue"))) 创建文档命中后的 _priority queue_ 。 -During the initial _query phase_, the((("distributed search execution", "query phase")))((("query phase of distributed search"))) query is broadcast to a shard copy (a -primary or replica shard) of every shard in the index. Each shard executes -the search locally and ((("priority queue")))builds a _priority queue_ of matching documents. - -.Priority Queue +.优先队列 **** -== 搜索语句 -在最初阶段_query phase_时,((("distributed search execution", "query phase")))((("query phase of distributed search")))搜索是广播查询索引中的每一个分片复本,不管是主本还是副本。每个分片执行搜索本地,同时((("priority queue")))创建文档命中后的_priority queue_。 - - -A _priority queue_ is just a sorted list that holds the _top-n_ matching -documents. The size of the priority queue depends on the pagination -parameters `from` and `size`. For example, the following search request -would require a priority queue big enough to hold 100 documents: +_priority queue_ 仅仅是一个含有命中文档的 _top-n_ 过滤后列表。优先队列的大小取决于分页参数 `from` 和 `size` 。例如,如下搜索请求将需要足够大的优先队列来放入100条文档。 -一个_priority queue_仅仅是一个执行过滤后列表 [source,js] -------------------------------------------------- GET /_search @@ -26,52 +15,30 @@ GET /_search -------------------------------------------------- **** -The query phase process is depicted in <>. +查询过程在 <> 中有描述。 [[img-distrib-search]] -.Query phase of distributed search -image::images/elas_0901.png["Query phase of distributed search"] +.Query phase of distributed s +.查询过程分布式搜索 +image::images/elas_0901.png["查询过程分布式搜索"] -The query phase consists of the following three steps: +查询过程包含以下几个步骤: -1. The client sends a `search` request to `Node 3`, which creates an empty - priority queue of size `from + size`. +1. 客户端发送 `search` 请求到 `Node 3`,会差生一个大小为 `from + size` 的空优先队列。 -2. `Node 3` forwards the search request to a primary or replica copy of every - shard in the index. Each shard executes the query locally and adds the - results into a local sorted priority queue of size `from + size`. +2. `Node 3` 将查询请求前转到每个索引的每个分片中的主本或复本去。每个分片执行本地查询并添加结果到大小为 `from + size` 的本地优先队列中。 -3. Each shard returns the doc IDs and sort values of all the docs in its - priority queue to the coordinating node, `Node 3`, which merges these - values into its own priority queue to produce a globally sorted list of - results. +3. 每个分片返回文档的IDs并且将所有优先队列中文档归类到对应的节点, `Node 3` 合并这些值到其优先队列中来产生一个全局排序后的列表。 -When a search request is sent to a node, that node becomes the coordinating -node.((("nodes", "coordinating node for search requests"))) It is the job of this node to broadcast the search request to all -involved shards, and to gather their responses into a globally sorted result -set that it can return to the client. +当查询请求到达节点的时候,节点变成了并列节点。 ((("nodes", "coordinating node for search requests"))) 这个节点任务是广播查询请求到所有相关节点并收集其他节点的返回状态存入全局排序后的集合,状态最终可以返回到客户端。 -The first step is to broadcast the request to a shard copy of every node in -the index. Just like <>, search requests -can be handled by a primary shard or by any of its replicas.((("shards", "handling search requests"))) This is how more -replicas (when combined with more hardware) can increase search throughput. -A coordinating node will round-robin through all shard copies on subsequent -requests in order to spread the load. +第一步是广播请求到索引中的每个几点钟一个分片复本去。就像 <> 查询请求可以被某个主分片或其副本处理, ((("shards", "handling search requests"))) 则是在结合硬件的时候处理多个复本如何增加查询吞吐率。一个并列节点将在之后的请求中轮询所有的分片复本来分散负载。 -Each shard executes the query locally and builds a sorted priority queue of -length `from + size`—in other words, enough results to satisfy the global -search request all by itself. It returns a lightweight list of results to the -coordinating node, which contains just the doc IDs and any values required for -sorting, such as the `_score`. +每个分片在本地执行查询请求并且创建一个长度为 `from + size`— 的优先队列;换句话说,它自己的查询结果来满足全局查询请求,它返回一个轻量级的结果列表到并列节点上,其中并列节点仅包含文档IDs和排序的任何值,比如 `_score` 。 -The coordinating node merges these shard-level results into its own sorted -priority queue, which represents the globally sorted result set. Here the query -phase ends. +并列节点合并了这些分片段到其排序后的优先队列,这些队列代表着全局排序结果集合,以下是查询过程结束。 [NOTE] ==== -An index can consist of one or more primary shards,((("indices", "multi-index search"))) so a search request -against a single index needs to be able to combine the results from multiple -shards. A search against _multiple_ or _all_ indices works in exactly the same -way--there are just more shards involved. +一个索引可被一个或几个主分片组成, ((("indices", "multi-index search"))) 所以一条搜索请求到单独的索引时需要参考多个分片。除了涉及到更多的分片, _multiple_ 或者 _all_ 索引搜索工作方式是一样的。 ==== From d17be7cf4ec07d32db1bfb69b24812521ea97ee3 Mon Sep 17 00:00:00 2001 From: JessicaWon <476556993@qq.com> Date: Thu, 8 Sep 2016 16:24:38 +0800 Subject: [PATCH 74/95] Revert "060 distributed search/00_Intro.asciidoc and 05_Query_phase.asciidoc" --- 060_Distributed_Search/00_Intro.asciidoc | 19 ------ .../05_Query_phase.asciidoc | 63 ++++++++++++++----- 2 files changed, 46 insertions(+), 36 deletions(-) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index 244247e87..a6098a6c5 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -32,22 +32,3 @@ But finding all matching documents is only half the story. Results from multiple shards must be combined into a single sorted list before the `search` API can return a ``page'' of results. For this reason, search is executed in a two-phase process called _query then fetch_. -[[分布式检索]] -== 分布式检索执行 - -在开始之前,我们先来讨论有关在分布式环境中检索是如何进行的。((("distributed search execution")))比我们之前在<>中讨论过的基础的_create-read-update-delete_ (CRUD)请求的((("CRUD (create-read-update-delete) operations")))较为简单。 - -.内容提示 -**** - -你有兴趣的话可以读一读这章,并不需要为了使用Elasticsearch而理解和记住所有的细节。 - -这章的阅读目的只为在脑海中形成服务运行的梗概以及了解信息的存放位置以便不时之需,但是不要被细节搞的云里雾里。 - -**** - -CRUD的操作处理一个单个的文档,此文档中有一个`_index`, `_type`和<>之间的特殊连接,其中<>的缺省值为`_id`。这意味着我们知道在集群中哪个分片存有此文档。 - -检索需要一个更为精细的模型因为我们不知道哪条文档会被命中:这些文档可能分布在集群的任何分片上。一条检索的请求需要参考我们感兴趣的所有索引中的每个分片复本,这样来确认索引中是否有任何匹配的文档。 - -定位所有的匹配文档仅仅是开始,不同分片的结果在`search`的API返回``page''结果前必须融合到一个单个的已分类列表中。正因为如此,检索执行通常两步走,先是_query,然后是fetch_。 diff --git a/060_Distributed_Search/05_Query_phase.asciidoc b/060_Distributed_Search/05_Query_phase.asciidoc index fe63293b1..dde4256bc 100644 --- a/060_Distributed_Search/05_Query_phase.asciidoc +++ b/060_Distributed_Search/05_Query_phase.asciidoc @@ -1,9 +1,16 @@ -=== 搜索阶段 -在最初阶段 _query phase_ 时, ((("distributed search execution", "query phase"))) ((("query phase of distributed search"))) 搜索是广播查询索引中的每一个分片复本,不管是主本还是副本。每个分片执行本地查询,同时 ((("priority queue"))) 创建文档命中后的 _priority queue_ 。 +=== Query Phase -.优先队列 +During the initial _query phase_, the((("distributed search execution", "query phase")))((("query phase of distributed search"))) query is broadcast to a shard copy (a +primary or replica shard) of every shard in the index. Each shard executes +the search locally and ((("priority queue")))builds a _priority queue_ of matching documents. + +.Priority Queue **** -_priority queue_ 仅仅是一个含有命中文档的 _top-n_ 过滤后列表。优先队列的大小取决于分页参数 `from` 和 `size` 。例如,如下搜索请求将需要足够大的优先队列来放入100条文档。 + +A _priority queue_ is just a sorted list that holds the _top-n_ matching +documents. The size of the priority queue depends on the pagination +parameters `from` and `size`. For example, the following search request +would require a priority queue big enough to hold 100 documents: [source,js] -------------------------------------------------- @@ -15,30 +22,52 @@ GET /_search -------------------------------------------------- **** -查询过程在 <> 中有描述。 +The query phase process is depicted in <>. [[img-distrib-search]] -.Query phase of distributed s -.查询过程分布式搜索 -image::images/elas_0901.png["查询过程分布式搜索"] +.Query phase of distributed search +image::images/elas_0901.png["Query phase of distributed search"] -查询过程包含以下几个步骤: +The query phase consists of the following three steps: -1. 客户端发送 `search` 请求到 `Node 3`,会差生一个大小为 `from + size` 的空优先队列。 +1. The client sends a `search` request to `Node 3`, which creates an empty + priority queue of size `from + size`. -2. `Node 3` 将查询请求前转到每个索引的每个分片中的主本或复本去。每个分片执行本地查询并添加结果到大小为 `from + size` 的本地优先队列中。 +2. `Node 3` forwards the search request to a primary or replica copy of every + shard in the index. Each shard executes the query locally and adds the + results into a local sorted priority queue of size `from + size`. -3. 每个分片返回文档的IDs并且将所有优先队列中文档归类到对应的节点, `Node 3` 合并这些值到其优先队列中来产生一个全局排序后的列表。 +3. Each shard returns the doc IDs and sort values of all the docs in its + priority queue to the coordinating node, `Node 3`, which merges these + values into its own priority queue to produce a globally sorted list of + results. -当查询请求到达节点的时候,节点变成了并列节点。 ((("nodes", "coordinating node for search requests"))) 这个节点任务是广播查询请求到所有相关节点并收集其他节点的返回状态存入全局排序后的集合,状态最终可以返回到客户端。 +When a search request is sent to a node, that node becomes the coordinating +node.((("nodes", "coordinating node for search requests"))) It is the job of this node to broadcast the search request to all +involved shards, and to gather their responses into a globally sorted result +set that it can return to the client. -第一步是广播请求到索引中的每个几点钟一个分片复本去。就像 <> 查询请求可以被某个主分片或其副本处理, ((("shards", "handling search requests"))) 则是在结合硬件的时候处理多个复本如何增加查询吞吐率。一个并列节点将在之后的请求中轮询所有的分片复本来分散负载。 +The first step is to broadcast the request to a shard copy of every node in +the index. Just like <>, search requests +can be handled by a primary shard or by any of its replicas.((("shards", "handling search requests"))) This is how more +replicas (when combined with more hardware) can increase search throughput. +A coordinating node will round-robin through all shard copies on subsequent +requests in order to spread the load. -每个分片在本地执行查询请求并且创建一个长度为 `from + size`— 的优先队列;换句话说,它自己的查询结果来满足全局查询请求,它返回一个轻量级的结果列表到并列节点上,其中并列节点仅包含文档IDs和排序的任何值,比如 `_score` 。 +Each shard executes the query locally and builds a sorted priority queue of +length `from + size`—in other words, enough results to satisfy the global +search request all by itself. It returns a lightweight list of results to the +coordinating node, which contains just the doc IDs and any values required for +sorting, such as the `_score`. -并列节点合并了这些分片段到其排序后的优先队列,这些队列代表着全局排序结果集合,以下是查询过程结束。 +The coordinating node merges these shard-level results into its own sorted +priority queue, which represents the globally sorted result set. Here the query +phase ends. [NOTE] ==== -一个索引可被一个或几个主分片组成, ((("indices", "multi-index search"))) 所以一条搜索请求到单独的索引时需要参考多个分片。除了涉及到更多的分片, _multiple_ 或者 _all_ 索引搜索工作方式是一样的。 +An index can consist of one or more primary shards,((("indices", "multi-index search"))) so a search request +against a single index needs to be able to combine the results from multiple +shards. A search against _multiple_ or _all_ indices works in exactly the same +way--there are just more shards involved. ==== From b1366a097359aecabaf0a94d0dc7a91770408e03 Mon Sep 17 00:00:00 2001 From: JessicaWon <476556993@qq.com> Date: Thu, 8 Sep 2016 02:05:41 -0700 Subject: [PATCH 75/95] chapter9_part1: /060_Distributed_Search/00_Intro.asciidoc --- 060_Distributed_Search/00_Intro.asciidoc | 32 ++++++------------------ 1 file changed, 8 insertions(+), 24 deletions(-) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index a6098a6c5..38cfbedc3 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -1,34 +1,18 @@ [[distributed-search]] -== Distributed Search Execution +== 分布式检索执行 -Before moving on, we are going to take a detour and talk about how search is -executed in a distributed environment.((("distributed search execution"))) It is a bit more complicated than the -basic _create-read-update-delete_ (CRUD) requests((("CRUD (create-read-update-delete) operations"))) that we discussed in -<>. +在继续之前,我们将讨论一下在分布式环境中搜索是怎么运行的。 ((("distributed search execution"))) 这比基本的 _create-read-update-delete_ (CRUD) 请求,即 <> 章节中的 ((("CRUD (create-read-update-delete) operations"))) 要复杂一些. -.Content Warning +.内容提示 **** -The information presented in this chapter is for your interest. You are not required to -understand and remember all the detail in order to use Elasticsearch. - -Read this chapter to gain a taste for how things work, and to know where the -information is in case you need to refer to it in the future, but don't be -overwhelmed by the detail. +你有兴趣的话可以读一读这章,注意的是并不需要为了使用Elasticsearch而理解和记住所有的细节。 +这章的阅读目的只为在脑海中形成服务运行的梗概以及了解信息的存放位置以便不时之需,但是不要被细节搞的云里雾里。 **** -A CRUD operation deals with a single document that has a unique combination of -`_index`, `_type`, and <> (which defaults to the -document's `_id`). This means that we know exactly which shard in the cluster -holds that document. +一条运行CRUD的操作处理一条单个的文档,这些文档中与 `_index` 、 `_type` 和 <> 有着特殊的连接,其中 <> 的默认值为文档中的 `_id` 值。这表示我们确切的知道集群中哪个分片含有此文档。 -Search requires a more complicated execution model because we don't know which -documents will match the query: they could be on any shard in the cluster. A -search request has to consult a copy of every shard in the index or indices -we're interested in to see if they have any matching documents. +搜索需要一种更加复杂的运行模型因为我们不知道查询会命中哪条文档,这些文档有可能在集群的任何分片上。一条查询请求务必询问我们在意的所有索引的所有分片来确保是否有任何命中的文档。 -But finding all matching documents is only half the story. Results from -multiple shards must be combined into a single sorted list before the `search` -API can return a ``page'' of results. For this reason, search is executed in a -two-phase process called _query then fetch_. +但是发现所有的命中文档仅仅是开始。多分片中的结果必须在 `search` 接口返回一个 ``page'' 结果前和单个的筛选后的列表联系起来,为此,查询执行分为两步,先是 _query ,后是 fetch_ 。 From 76eee1a54f932106ec3ec3ebbc0f8370b02daf8c Mon Sep 17 00:00:00 2001 From: JessicaWon <476556993@qq.com> Date: Mon, 12 Sep 2016 10:36:40 +0800 Subject: [PATCH 76/95] Update 00_Intro.asciidoc --- 060_Distributed_Search/00_Intro.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index 38cfbedc3..af53af8cc 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -1,7 +1,7 @@ [[distributed-search]] == 分布式检索执行 -在继续之前,我们将讨论一下在分布式环境中搜索是怎么运行的。 ((("distributed search execution"))) 这比基本的 _create-read-update-delete_ (CRUD) 请求,即 <> 章节中的 ((("CRUD (create-read-update-delete) operations"))) 要复杂一些. +在继续之前,我们将讨论一下在分布式环境中搜索是怎么运行的。 ((("distributed search execution"))) 这比基本的 _create-read-update-delete_ (CRUD) 请求,即 <> 章节中的 ((("CRUD (create-read-update-delete) operations"))) 要复杂一些。 .内容提示 **** From 7d1860cb410946e5c301b77c82edf4bf395fe2d9 Mon Sep 17 00:00:00 2001 From: "feng.wei" Date: Mon, 12 Sep 2016 16:44:22 +0800 Subject: [PATCH 77/95] commit 050_search part2 --- 050_Search/10_Multi_index_multi_type.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/050_Search/10_Multi_index_multi_type.asciidoc b/050_Search/10_Multi_index_multi_type.asciidoc index 8024251fd..0b7a0f378 100644 --- a/050_Search/10_Multi_index_multi_type.asciidoc +++ b/050_Search/10_Multi_index_multi_type.asciidoc @@ -39,4 +39,4 @@ ================================================ -最后,你将明白这种简单的方式如何弹性的把请求的变化变得简单化。 +最后,你将明白这种简单的方式如何弹性的把请求的变化变得简单化。 From 430d82667423a51c33119e2bff7c72521fd4103f Mon Sep 17 00:00:00 2001 From: JessicaWon <476556993@qq.com> Date: Sun, 16 Oct 2016 20:54:05 -0700 Subject: [PATCH 78/95] modified: 00_Intro.asciidoc --- 060_Distributed_Search/00_Intro.asciidoc | 35 ++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index af53af8cc..5b85a9091 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -1,3 +1,38 @@ +[[distributed-search]] +== Distributed Search Execution + +Before moving on, we are going to take a detour and talk about how search is +executed in a distributed environment.((("distributed search execution"))) It is a bit more complicated than the +basic _create-read-update-delete_ (CRUD) requests((("CRUD (create-read-update-delete) operations"))) that we discussed in +<>. + +.Content Warning +**** + +The information presented in this chapter is for your interest. You are not required to +understand and remember all the detail in order to use Elasticsearch. + +Read this chapter to gain a taste for how things work, and to know where the +information is in case you need to refer to it in the future, but don't be +overwhelmed by the detail. + +**** + +A CRUD operation deals with a single document that has a unique combination of +`_index`, `_type`, and <> (which defaults to the +document's `_id`). This means that we know exactly which shard in the cluster +holds that document. + +Search requires a more complicated execution model because we don't know which +documents will match the query: they could be on any shard in the cluster. A +search request has to consult a copy of every shard in the index or indices +we're interested in to see if they have any matching documents. + +But finding all matching documents is only half the story. Results from +multiple shards must be combined into a single sorted list before the `search` +API can return a ``page'' of results. For this reason, search is executed in a +two-phase process called _query then fetch_. + [[distributed-search]] == 分布式检索执行 From d688a066780ce5beb3d4594a4511959d158b2b6e Mon Sep 17 00:00:00 2001 From: luotitan Date: Mon, 17 Oct 2016 15:22:15 +0800 Subject: [PATCH 79/95] Revert "chapter9_part1: /060_Distributed_Search/00_Intro.asciidoc" --- 060_Distributed_Search/00_Intro.asciidoc | 19 ------------------- 1 file changed, 19 deletions(-) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index 5b85a9091..a6098a6c5 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -32,22 +32,3 @@ But finding all matching documents is only half the story. Results from multiple shards must be combined into a single sorted list before the `search` API can return a ``page'' of results. For this reason, search is executed in a two-phase process called _query then fetch_. - -[[distributed-search]] -== 分布式检索执行 - -在继续之前,我们将讨论一下在分布式环境中搜索是怎么运行的。 ((("distributed search execution"))) 这比基本的 _create-read-update-delete_ (CRUD) 请求,即 <> 章节中的 ((("CRUD (create-read-update-delete) operations"))) 要复杂一些。 - -.内容提示 -**** - -你有兴趣的话可以读一读这章,注意的是并不需要为了使用Elasticsearch而理解和记住所有的细节。 -这章的阅读目的只为在脑海中形成服务运行的梗概以及了解信息的存放位置以便不时之需,但是不要被细节搞的云里雾里。 - -**** - -一条运行CRUD的操作处理一条单个的文档,这些文档中与 `_index` 、 `_type` 和 <> 有着特殊的连接,其中 <> 的默认值为文档中的 `_id` 值。这表示我们确切的知道集群中哪个分片含有此文档。 - -搜索需要一种更加复杂的运行模型因为我们不知道查询会命中哪条文档,这些文档有可能在集群的任何分片上。一条查询请求务必询问我们在意的所有索引的所有分片来确保是否有任何命中的文档。 - -但是发现所有的命中文档仅仅是开始。多分片中的结果必须在 `search` 接口返回一个 ``page'' 结果前和单个的筛选后的列表联系起来,为此,查询执行分为两步,先是 _query ,后是 fetch_ 。 From dab560fde7c7974a0b03f51d3db813e95fbfd67d Mon Sep 17 00:00:00 2001 From: yichao2015 <675340089@qq.com> Date: Mon, 17 Oct 2016 04:20:56 -0500 Subject: [PATCH 80/95] chapter10_part2:/070_Index_Mgmt/10_Settings.asciidoc (#297) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * chapter10_part2:/070_Index_Mgmt/10_Settings.asciidoc * 二次review后修改 --- 070_Index_Mgmt/10_Settings.asciidoc | 27 ++++++++++----------------- 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/070_Index_Mgmt/10_Settings.asciidoc b/070_Index_Mgmt/10_Settings.asciidoc index ac7373fa6..88da77ec6 100644 --- a/070_Index_Mgmt/10_Settings.asciidoc +++ b/070_Index_Mgmt/10_Settings.asciidoc @@ -1,28 +1,21 @@ -=== Index Settings +=== 索引设置 -There are many many knobs((("index settings"))) that you can twiddle to -customize index behavior, which you can read about in the -{ref}/index-modules.html[Index Modules reference documentation], -but... +你可以通过修改配置来((("index settings")))自定义索引行为,详细配置参照 +{ref}/index-modules.html[索引模块] -TIP: Elasticsearch comes with good defaults. Don't twiddle these knobs until -you understand what they do and why you should change them. +TIP: Elasticsearch 提供了优化好的默认配置。 除非你理解这些配置的作用并且知道为什么要去修改,否则不要随意修改。 -Two of the most important((("shards", "number_of_shards index setting")))((("number_of_shards setting")))((("index settings", "number_of_shards"))) settings are as follows: +下面是两个((("shards", "number_of_shards index setting")))((("number_of_shards setting")))((("index settings", "number_of_shards"))) 最重要的设置: `number_of_shards`:: - The number of primary shards that an index should have, - which defaults to `5`. This setting cannot be changed - after index creation. + 每个索引的主分片数,默认值是 `5` 。这个配置在索引创建后不能修改。 `number_of_replicas`:: - The number of replica shards (copies) that each primary shard - should have, which defaults to `1`. This setting can be changed - at any time on a live index. + 每个主分片的副本数,默认值是 `1` 。对于活动的索引库,这个配置可以随时修改。 -For instance, we could create a small index--just((("index settings", "number_of_replicas")))((("replica shards", "number_of_replicas index setting"))) one primary shard--and no replica shards with the following request: +例如,我们可以创建只有((("index settings", "number_of_replicas")))((("replica shards", "number_of_replicas index setting"))) 一个主分片,没有副本的小索引: [source,js] -------------------------------------------------- @@ -36,8 +29,8 @@ PUT /my_temp_index -------------------------------------------------- // SENSE: 070_Index_Mgmt/10_Settings.json -Later, we can change the number of replica shards dynamically using the -`update-index-settings` API as((("update-index-settings API"))) follows: +然后,我们可以用 +`update-index-settings` API ((("update-index-settings API"))) 动态修改副本数: [source,js] -------------------------------------------------- From a68b540749055a72c1e07b79638fdad13c8d46ec Mon Sep 17 00:00:00 2001 From: yichao2015 <675340089@qq.com> Date: Mon, 17 Oct 2016 04:29:40 -0500 Subject: [PATCH 81/95] chapter10_part8:/070_Index_Mgmt/32_Metadata_all.asciidoc (#300) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * chapter10_part8:/070_Index_Mgmt/32_Metadata_all.asciidoc * review后修改 * 二次review后修改 --- 070_Index_Mgmt/32_Metadata_all.asciidoc | 46 +++++-------------------- 1 file changed, 9 insertions(+), 37 deletions(-) diff --git a/070_Index_Mgmt/32_Metadata_all.asciidoc b/070_Index_Mgmt/32_Metadata_all.asciidoc index d4b61c819..82dfe1677 100644 --- a/070_Index_Mgmt/32_Metadata_all.asciidoc +++ b/070_Index_Mgmt/32_Metadata_all.asciidoc @@ -1,15 +1,9 @@ [[all-field]] -==== Metadata: _all Field +==== 元数据: _all 字段 -In <>, we introduced the `_all` field: a special field that -indexes the ((("metadata, document", "_all field")))((("_all field", sortas="all field")))values from all other fields as one big string. The `query_string` -query clause (and searches performed as `?q=john`) defaults to searching in -the `_all` field if no other field is specified. +在 <> 中,我们介绍了 `_all` 字段:一个把其它字段值((("metadata, document", "_all field")))((("_all field", sortas="all field")))当作一个大字符串来索引的特殊字段。 `query_string` 查询子句(搜索 `?q=john` )在没有指定字段时默认使用 `_all` 字段。 -The `_all` field is useful during the exploratory phase of a new application, -while you are still unsure about the final structure that your documents will -have. You can throw any query string at it and you have a good chance of -finding the document you're after: +`_all` 字段在新应用的探索阶段,当你还不清楚文档的最终结构时是比较有用的。你可以使用这个字段来做任何查询,并且有很大可能找到需要的文档: [source,js] -------------------------------------------------- @@ -22,24 +16,14 @@ GET /_search -------------------------------------------------- -As your application evolves and your search requirements become more exacting, -you will find yourself using the `_all` field less and less. The `_all` field -is a shotgun approach to search. By querying individual fields, you have more -flexbility, power, and fine-grained control over which results are considered -to be most relevant. +随着应用的发展,搜索需求变得更加明确,你会发现自己越来越少使用 `_all` 字段。 `_all` 字段是搜索的应急之策。通过查询指定字段,你的查询更加灵活、强大,你也可以对相关性最高的搜索结果进行更细粒度的控制。 [NOTE] ==== -One of the important factors taken into account by the -<> -is the length of the field: the shorter the field, the more important. A term -that appears in a short `title` field is likely to be more important than the -same term that appears somewhere in a long `content` field. This distinction -between field lengths disappears in the `_all` field. +<> 考虑的一个最重要的原则是字段的长度:字段越短越重要。 在较短的 `title` 字段中出现的短语可能比在较长的 `content` 字段中出现的短语更加重要。字段长度的区别在 `_all` 字段中不会出现。 ==== -If you decide that you no longer need the `_all` field, you can disable it -with this mapping: +如果你不再需要 `_all` 字段,你可以通过下面的映射来禁用: [source,js] -------------------------------------------------- @@ -51,17 +35,9 @@ PUT /my_index/_mapping/my_type } -------------------------------------------------- +通过 `include_in_all` 设置来逐个控制字段是否要包含在 `_all` 字段中,((("include_in_all setting")))默认值是 `true`。在一个对象(或根对象)上设置 `include_in_all` 可以修改这个对象中的所有字段的默认行为。 -Inclusion in the `_all` field can be controlled on a field-by-field basis -by using the `include_in_all` setting, ((("include_in_all setting")))which defaults to `true`. Setting -`include_in_all` on an object (or on the root object) changes the -default for all fields within that object. - -You may find that you want to keep the `_all` field around to use -as a catchall full-text field just for specific fields, such as -`title`, `overview`, `summary`, and `tags`. Instead of disabling the `_all` -field completely, disable `include_in_all` for all fields by default, -and enable it only on the fields you choose: +你可能想要保留 `_all` 字段作为一个只包含某些特定字段的全文字段,例如只包含 `title`,`overview`,`summary` 和 `tags`。 相对于完全禁用 `_all` 字段,你可以为所有字段默认禁用 `include_in_all` 选项,仅在你选择的字段上启用: [source,js] -------------------------------------------------- @@ -81,11 +57,7 @@ PUT /my_index/my_type/_mapping -------------------------------------------------- -Remember that the `_all` field is just((("analyzers", "configuring for all field"))) an analyzed `string` field. It -uses the default analyzer to analyze its values, regardless of which -analyzer has been set on the fields where the values originate. And -like any `string` field, you can configure which analyzer the `_all` -field should use: +记住,`_all` 字段仅仅是一个((("analyzers", "configuring for all field"))) 经过分词的 `string` 字段。它使用默认分词器来分析它的值,不管这个值原本所在字段指定的分词器。就像所有 `string` 字段,你可以配置 `_all` 字段使用的分词器: [source,js] -------------------------------------------------- From 0b46e8a72be4635fae5bf2080c64aada064cd825 Mon Sep 17 00:00:00 2001 From: cdma Date: Fri, 21 Oct 2016 10:37:19 +0800 Subject: [PATCH 82/95] =?UTF-8?q?chapter10=5Fpart13=EF=BC=9A070=5FIndex=5F?= =?UTF-8?q?Mgmt/50=5FReindexing.asciidoc=20(#309)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * translated * translate overs * change review result and change branch name --- 070_Index_Mgmt/50_Reindexing.asciidoc | 39 ++++++++++----------------- 1 file changed, 14 insertions(+), 25 deletions(-) diff --git a/070_Index_Mgmt/50_Reindexing.asciidoc b/070_Index_Mgmt/50_Reindexing.asciidoc index a0d54ed14..de15cd59f 100644 --- a/070_Index_Mgmt/50_Reindexing.asciidoc +++ b/070_Index_Mgmt/50_Reindexing.asciidoc @@ -1,32 +1,24 @@ [[reindex]] -=== Reindexing Your Data +=== 重新索引你的数据 -Although you can add new types to an index, or add new fields to a type, you -can't add new analyzers or make changes to existing fields.((("reindexing")))((("indexing", "reindexing your data"))) If you were to do -so, the data that had already been indexed would be incorrect and your -searches would no longer work as expected. +尽管可以增加新的类型到索引中,或者增加新的字段到类型中,但是不能添加新的分析器或者对现有的字段做改动。 + ((("reindexing")))((("indexing", "reindexing your data"))) 如果你那么做的话,结果就是那些已经被索引的数据就不正确, +搜索也不能正常工作。 -The simplest way to apply these changes to your existing data is to -reindex: create a new index with the new settings and copy all of your -documents from the old index to the new index. +对现有数据的这类改变最简单的办法就是重新索引:用新的设置创建新的索引并把文档从旧的索引复制到新的索引。 -One of the advantages of the `_source` field is that you already have the -whole document available to you in Elasticsearch itself. You don't have to -rebuild your index from the database, which is usually much slower. +字段 `_source` 的一个优点是在Elasticsearch中已经有整个文档。你不必从源数据中重建索引,而且那样通常比较慢。 -To reindex all of the documents from the old index efficiently, use -<> to retrieve batches((("using in reindexing documents"))) of documents from the old index, -and the <> to push them into the new index. +为了有效的重新索引所有在旧的索引中的文档,用 <> 从旧的索引检索批量文档 ((("using in reindexing documents"))) , +然后用 <> 把文档推送到新的索引中。 -Beginning with Elasticsearch v2.3.0, a {ref}/docs-reindex.html[Reindex API] has been introduced. It enables you -to reindex your documents without requiring any plugin nor external tool. +从Elasticsearch v2.3.0开始, {ref}/docs-reindex.html[Reindex API] 被引入。它能够对文档重建索引而不需要任何插件或外部工具。 -.Reindexing in Batches +.批量重新索引 **** -You can run multiple reindexing jobs at the same time, but you obviously don't -want their results to overlap. Instead, break a big reindex down into smaller -jobs by filtering on a date or timestamp field: +同时并行运行多个重建索引任务,但是你显然不希望结果有重叠。正确的做法是按日期或者时间 +这样的字段作为过滤条件把大的重建索引分成小的任务: [source,js] -------------------------------------------------- @@ -46,11 +38,8 @@ GET /old_index/_search?scroll=1m -------------------------------------------------- -If you continue making changes to the old index, you will want to make -sure that you include the newly added documents in your new index as well. -This can be done by rerunning the reindex process, but again filtering -on a date field to match only documents that have been added since the -last reindex process started. +如果旧的索引持续会有变化,你希望新的索引中也包括那些新加的文档。那就可以对新加的文档做重新索引, +但还是要用日期类字段过滤来匹配那些新加的文档。 **** From 5aa3bbe49a0a8adbd34639d51e550dc71d882ee1 Mon Sep 17 00:00:00 2001 From: yichao2015 <675340089@qq.com> Date: Thu, 20 Oct 2016 21:43:07 -0500 Subject: [PATCH 83/95] chapter10_part12:/070_Index_Mgmt/45_Default_Mapping.asciidoc (#299) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * chapter10_part12:/070_Index_Mgmt/45_Default_Mapping.asciidoc * 二次review后修改 * 上次没修改完 --- 070_Index_Mgmt/45_Default_Mapping.asciidoc | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/070_Index_Mgmt/45_Default_Mapping.asciidoc b/070_Index_Mgmt/45_Default_Mapping.asciidoc index a122e7d6b..7ae86bd18 100644 --- a/070_Index_Mgmt/45_Default_Mapping.asciidoc +++ b/070_Index_Mgmt/45_Default_Mapping.asciidoc @@ -1,15 +1,9 @@ [[default-mapping]] -=== Default Mapping +=== 缺省映射 -Often, all types in an index share similar fields and settings. ((("mapping (types)", "default")))((("default mapping"))) It can be -more convenient to specify these common settings in the `_default_` mapping, -instead of having to repeat yourself every time you create a new type. The -`_default_` mapping acts as a template for new types. All types created -_after_ the `_default_` mapping will include all of these default settings, -unless explicitly overridden in the type mapping itself. +通常,一个索引中的所有类型共享相同的字段和设置。 ((("mapping (types)", "default")))((("default mapping"))) `_default_` 映射更加方便地指定通用设置,而不是每次创建新类型时都要重复设置。 `_default_` 映射是新类型的模板。在设置 `_default_` 映射之后创建的所有类型都将应用这些缺省的设置,除非类型在自己的映射中明确覆盖这些设置。 -For instance, we can disable the `_all` field for all types,((("_all field", sortas="all field"))) using the -`_default_` mapping, but enable it just for the `blog` type, as follows: +例如,我们可以使用 `_default_` 映射为所有的类型禁用 `_all` 字段,((("_all field", sortas="all field"))) 而只在 `blog` 类型启用: [source,js] -------------------------------------------------- @@ -28,5 +22,4 @@ PUT /my_index // SENSE: 070_Index_Mgmt/45_Default_mapping.json -The `_default_` mapping can also be a good place to specify index-wide -<>. +`_default_` 映射也是一个指定索引 <> 的好方法。 From 1f8d21b36e06fd90955290c8d38dbd61ea48846e Mon Sep 17 00:00:00 2001 From: luotitan Date: Fri, 21 Oct 2016 10:47:22 +0800 Subject: [PATCH 84/95] chapter10_part9: /070_Index_Mgmt/33_Metadata_ID.asciidoc (#196) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 元数据:文档标识-第一次提交 * 根据review意见修改 --- 070_Index_Mgmt/33_Metadata_ID.asciidoc | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/070_Index_Mgmt/33_Metadata_ID.asciidoc b/070_Index_Mgmt/33_Metadata_ID.asciidoc index 094d1ccda..12146b0be 100644 --- a/070_Index_Mgmt/33_Metadata_ID.asciidoc +++ b/070_Index_Mgmt/33_Metadata_ID.asciidoc @@ -1,25 +1,22 @@ -==== Metadata: Document Identity +==== 元数据:文档标识 -There are four metadata fields ((("metadata, document", "identity")))associated with document identity: +文档标识与四个元数据字段((("metadata, document", "identity")))相关: `_id`:: - The string ID of the document + 文档的 ID 字符串 `_type`:: - The type name of the document + 文档的类型名 `_index`:: - The index where the document lives + 文档所在的索引 `_uid`:: - The `_type` and `_id` concatenated together as `type#id` + `_type` 和 `_id` 连接在一起构造成 `type#id` -By default, the `_uid` field is((("id field"))) stored (can be retrieved) and -indexed (searchable). The `_type` field((("type field")))((("index field")))((("uid field"))) is indexed but not stored, -and the `_id` and `_index` fields are neither indexed nor stored, meaning -they don't really exist. +默认情况下, `_uid` 字段是被((("id field")))存储(可取回)和索引(可搜索)的。 +`_type` 字段((("type field")))((("index field")))((("uid field")))被索引但是没有存储, +`_id` 和 `_index` 字段则既没有被索引也没有被存储,这意味着它们并不是真实存在的。 -In spite of this, you can query the `_id` field as though it were a real -field. Elasticsearch uses the `_uid` field to derive the `_id`. Although you -can change the `index` and `store` settings for these fields, you almost -never need to do so. +尽管如此,你仍然可以像真实字段一样查询 `_id` 字段。Elasticsearch 使用 `_uid` 字段来派生出 `_id` 。 +虽然你可以修改这些字段的 `index` 和 `store` 设置,但是基本上不需要这么做。 From a6dc87709713eaed97ef9eb316eb630cb38a1de1 Mon Sep 17 00:00:00 2001 From: luotitan Date: Fri, 21 Oct 2016 11:10:09 +0800 Subject: [PATCH 85/95] chapter9_part1: /060_Distributed_Search/00_Intro.asciidoc (#321) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 卷发同学完成翻译,修改以后提交 * 根据review意见修改 --- 060_Distributed_Search/00_Intro.asciidoc | 41 +++++++++++------------- 1 file changed, 18 insertions(+), 23 deletions(-) diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index a6098a6c5..93b6303c1 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -1,34 +1,29 @@ [[distributed-search]] -== Distributed Search Execution +== 执行分布式检索 -Before moving on, we are going to take a detour and talk about how search is -executed in a distributed environment.((("distributed search execution"))) It is a bit more complicated than the -basic _create-read-update-delete_ (CRUD) requests((("CRUD (create-read-update-delete) operations"))) that we discussed in -<>. +在继续之前,我们将绕道讨论一下在分布式环境中搜索是怎么执行的。 +((("distributed search execution"))) 这比我们在 <> 章节讨论的基本的 _增-删-改-查_ (CRUD)((("CRUD (create-read-update-delete) operations")))请求要复杂一些。 -.Content Warning + +.内容提示 **** -The information presented in this chapter is for your interest. You are not required to -understand and remember all the detail in order to use Elasticsearch. +你可以根据兴趣阅读本章内容。你并不需要为了使用 Elasticsearch 而理解和记住所有的细节。 -Read this chapter to gain a taste for how things work, and to know where the -information is in case you need to refer to it in the future, but don't be -overwhelmed by the detail. +这章的阅读目的只为初步了解下工作原理,以便将来需要时可以及时找到这些知识, +但是不要被细节所困扰。 **** -A CRUD operation deals with a single document that has a unique combination of -`_index`, `_type`, and <> (which defaults to the -document's `_id`). This means that we know exactly which shard in the cluster -holds that document. +一个 CRUD 操作只对单个文档进行处理,文档的唯一性由 `_index`, `_type`, +和 <> (通常默认是该文档的 `_id` )的组合来确定。 +这表示我们确切的知道集群中哪个分片含有此文档。 + + +搜索需要一种更加复杂的执行模型因为我们不知道查询会命中哪些文档: 这些文档有可能在集群的任何分片上。 +一个搜索请求必须询问我们关注的索引(index or indices)的所有分片的某个副本来确定它们是否含有任何匹配的文档。 -Search requires a more complicated execution model because we don't know which -documents will match the query: they could be on any shard in the cluster. A -search request has to consult a copy of every shard in the index or indices -we're interested in to see if they have any matching documents. -But finding all matching documents is only half the story. Results from -multiple shards must be combined into a single sorted list before the `search` -API can return a ``page'' of results. For this reason, search is executed in a -two-phase process called _query then fetch_. +但是找到所有的匹配文档仅仅完成事情的一半。 +在 `search` 接口返回一个 ``page`` 结果之前,多分片中的结果必须组合成单个排序列表。 +为此,搜索被执行成一个两阶段过程,我们称之为 _query then fetch_ 。 From c6541495b7bd500191837727e87c23b9f24db0e3 Mon Sep 17 00:00:00 2001 From: weiqiangyuan Date: Sat, 22 Oct 2016 14:25:44 +0800 Subject: [PATCH 86/95] chapter43_part2: /404_Parent_Child/45_Indexing_parent_child.asciidoc (#273) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 完成Parent-Child第二章的翻译 * revise acording pr first review * revise pc * add desc for single doc * revise type --- .../45_Indexing_parent_child.asciidoc | 38 +++++-------------- 1 file changed, 10 insertions(+), 28 deletions(-) diff --git a/404_Parent_Child/45_Indexing_parent_child.asciidoc b/404_Parent_Child/45_Indexing_parent_child.asciidoc index 26fa7210a..9d6b9e15f 100644 --- a/404_Parent_Child/45_Indexing_parent_child.asciidoc +++ b/404_Parent_Child/45_Indexing_parent_child.asciidoc @@ -1,8 +1,7 @@ [[indexing-parent-child]] -=== Indexing Parents and Children +=== 构建父-子文档索引 -Indexing parent documents is no different from any other document. Parents -don't need to know anything about their children: +为父文档创建索引与为普通文档创建索引没有区别。父文档并不需要知道它有哪些子文档。 [source,json] ------------------------- @@ -15,8 +14,7 @@ POST /company/branch/_bulk { "name": "Champs Élysées", "city": "Paris", "country": "France" } ------------------------- -When indexing child documents, you must specify the ID of the associated -parent document: +创建子文档时,用户必须要通过 `parent` 参数来指定该子文档的父文档 ID: [source,json] ------------------------- @@ -27,31 +25,19 @@ PUT /company/employee/1?parent=london <1> "hobby": "hiking" } ------------------------- -<1> This `employee` document is a child of the `london` branch. +<1> 当前 `employee` 文档的父文档 ID 是 `london` 。 -This `parent` ID serves two purposes: it creates the link between the parent -and the child, and it ensures that the child document is stored on the same -shard as the parent. +父文档 ID 有两个作用:创建了父文档和子文档之间的关系,并且保证了父文档和子文档都在同一个分片上。 -In <>, we explained how Elasticsearch uses a routing value, -which defaults to the `_id` of the document, to decide which shard a document -should belong to. The routing value is plugged into this simple formula: +在 <> 中,我们解释了 Elasticsearch 如何通过路由值来决定该文档属于哪一个分片,路由值默认为该文档的 `_id` 。分片路由的计算公式如下: shard = hash(routing) % number_of_primary_shards -However, if a `parent` ID is specified, it is used as the routing value -instead of the `_id`. In other words, both the parent and the child use the -same routing value--the `_id` of the parent--and so they are both stored -on the same shard. +如果指定了父文档的 ID,那么就会使用父文档的 ID 进行路由,而不会使用当前文档 `_id` 。也就是说,如果父文档和子文档都使用相同的值进行路由,那么父文档和子文档都会确定分布在同一个分片上。 -The `parent` ID needs to be specified on all single-document requests: -when retrieving a child document with a `GET` request, or when indexing, -updating, or deleting a child document. Unlike a search request, which is -forwarded to all shards in an index, these single-document requests are -forwarded only to the shard that holds the document--if the `parent` ID is -not specified, the request will probably be forwarded to the wrong shard. +在执行单文档的请求时需要指定父文档的 ID,单文档请求包括:通过 `GET` 请求获取一个子文档;创建、更新或删除一个子文档。而执行搜索请求时是不需要指定父文档的ID,这是因为搜索请求是向一个索引中的所有分片发起请求,而单文档的操作是只会向存储该文档的分片发送请求。因此,如果操作单个子文档时不指定父文档的 ID,那么很有可能会把请求发送到错误的分片上。 -The `parent` ID should also be specified when using the `bulk` API: +父文档的 ID 应该在 `bulk` API 中指定 [source,json] ------------------------- @@ -64,8 +50,4 @@ POST /company/employee/_bulk { "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" } ------------------------- -WARNING: If you want to change the `parent` value of a child document, it is -not sufficient to just reindex or update the child document--the new parent -document may be on a different shard. Instead, you must first delete the old -child, and then index the new child. - +WARNING: 如果你想要改变一个子文档的 `parent` 值,仅通过更新这个子文档是不够的,因为新的父文档有可能在另外一个分片上。因此,你必须要先把子文档删除,然后再重新索引这个子文档。 From 5f8c44f5f34c4d80ae4eb3b9d331e4f6bda70e3e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=A5=B6=E7=90=9B=E7=90=B3?= Date: Sat, 22 Oct 2016 14:28:22 +0800 Subject: [PATCH 87/95] chapter47_part3:/520_Post_Deployment/30_indexing_perf.asciidoc (#56) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * chapter47_part3:/520_Post_Deployment/30_indexing_perf.asciidoc * 按照review意见修改 --- 520_Post_Deployment/30_indexing_perf.asciidoc | 182 +++++------------- 1 file changed, 53 insertions(+), 129 deletions(-) diff --git a/520_Post_Deployment/30_indexing_perf.asciidoc b/520_Post_Deployment/30_indexing_perf.asciidoc index 7a15fb459..8cbdf8d5e 100644 --- a/520_Post_Deployment/30_indexing_perf.asciidoc +++ b/520_Post_Deployment/30_indexing_perf.asciidoc @@ -1,111 +1,64 @@ [[indexing-performance]] -=== Indexing Performance Tips +=== 索引性能技巧 -If you are in an indexing-heavy environment,((("indexing", "performance tips")))((("post-deployment", "indexing performance tips"))) such as indexing infrastructure -logs, you may be willing to sacrifice some search performance for faster indexing -rates. In these scenarios, searches tend to be relatively rare and performed -by people internal to your organization. They are willing to wait several -seconds for a search, as opposed to a consumer facing a search that must -return in milliseconds. +如果你是在一个索引负载很重的环境,((("indexing", "performance tips")))((("post-deployment", "indexing performance tips")))比如索引的是基础设施日志,你可能愿意牺牲一些搜索性能换取更快的索引速率。在这些场景里,搜索常常是很少见的操作,而且一般是由你公司内部的人发起的。他们也愿意为一个搜索等上几秒钟,而不像普通消费者,要求一个搜索必须毫秒级返回。 -Because of this unique position, certain trade-offs can be made -that will increase your indexing performance. +基于这种特殊的场景,我们可以有几种权衡办法来提高你的索引性能。 -.These Tips Apply Only to Elasticsearch 1.3+ +.这些技巧仅适用于 Elasticsearch 1.3 及以上版本 **** -This book is written for the most recent versions of Elasticsearch, although much -of the content works on older versions. +本书是为最新几个版本的 Elasticsearch 写的,虽然大多数内容在更老的版本也也有效。 -The tips presented in this section, however, are _explicitly_ for version 1.3+. There -have been multiple performance improvements and bugs fixed that directly impact -indexing. In fact, some of these recommendations will _reduce_ performance on -older versions because of the presence of bugs or performance defects. +不过,本节提及的技巧, _只_ 针对 1.3 及以上版本。该版本后有不少性能提升和故障修复是直接影响到索引的。事实上,有些建议在老版本上反而会因为故障或性能缺陷而 _降低_ 性能。 **** -==== Test Performance Scientifically +==== 科学的测试性能 -Performance testing is always difficult, so try to be as scientific as possible -in your approach.((("performance testing")))((("indexing", "performance tips", "performance testing"))) Randomly fiddling with knobs and turning on ingestion is not -a good way to tune performance. If there are too many _causes_, it is impossible -to determine which one had the best _effect_. A reasonable approach to testing is as follows: +性能测试永远是复杂的,所以在你的方法里已经要尽可能的科学。((("performance testing")))((("indexing", "performance tips", "performance testing")))随机摆弄旋钮以及写入开关可不是做性能调优的好办法。如果有太多种 _可能_ ,我们就无法判断到底哪一种有最好的 _效果_ 。合理的测试方法如下: -1. Test performance on a single node, with a single shard and no replicas. -2. Record performance under 100% default settings so that you have a baseline to -measure against. -3. Make sure performance tests run for a long time (30+ minutes) so you can -evaluate long-term performance, not short-term spikes or latencies. Some events -(such as segment merging, and GCs) won't happen right away, so the performance -profile can change over time. -4. Begin making single changes to the baseline defaults. Test these rigorously, -and if performance improvement is acceptable, keep the setting and move on to the -next one. +1. 在单个节点上,对单个分片,无副本的场景测试性能。 +2. 在 100% 默认配置的情况下记录性能结果,这样你就有了一个对比基线。 +3. 确保性能测试运行足够长的时间(30 分钟以上)这样你可以评估长期性能,而不是短期的峰值或延迟。一些事件(比如段合并,GC)不会立刻发生,所以性能概况会随着时间继续而改变的。 +4. 开始在基线上逐一修改默认值。严格测试它们,如果性能提升可以接受,保留这个配置项,开始下一项。 -==== Using and Sizing Bulk Requests +==== 使用批量请求并调整其大小 -This should be fairly obvious, but use bulk indexing requests for optimal performance.((("indexing", "performance tips", "bulk requests, using and sizing")))((("bulk API", "using and sizing bulk requests"))) -Bulk sizing is dependent on your data, analysis, and cluster configuration, but -a good starting point is 5–15 MB per bulk. Note that this is physical size. -Document count is not a good metric for bulk size. For example, if you are -indexing 1,000 documents per bulk, keep the following in mind: +显而易见的,优化性能应该使用批量请求。((("indexing", "performance tips", "bulk requests, using and sizing")))((("bulk API", "using and sizing bulk requests")))批量的大小则取决于你的数据、分析和集群配置,不过每次批量数据 5–15 MB 大是个不错的起始点。注意这里说的是物理字节数大小。文档计数对批量大小来说不是一个好指标。比如说,如果你每次批量索引 1000 个文档,记住下面的事实: -- 1,000 documents at 1 KB each is 1 MB. -- 1,000 documents at 100 KB each is 100 MB. +- 1000 个 1 KB 大小的文档加起来是 1 MB 大。 +- 1000 个 100 KB 大小的文档加起来是 100 MB 大。 -Those are drastically different bulk sizes. Bulks need to be loaded into memory -at the coordinating node, so it is the physical size of the bulk that is more -important than the document count. +这可是完完全全不一样的批量大小了。批量请求需要在协调节点上加载进内存,所以批量请求的物理大小比文档计数重要得多。 -Start with a bulk size around 5–15 MB and slowly increase it until you do not -see performance gains anymore. Then start increasing the concurrency of your -bulk ingestion (multiple threads, and so forth). +从 5–15 MB 开始测试批量请求大小,缓慢增加这个数字,直到你看不到性能提升为止。然后开始增加你的批量写入的并发度(多线程等等办法)。 -Monitor your nodes with Marvel and/or tools such as `iostat`, `top`, and `ps` to see -when resources start to bottleneck. If you start to receive `EsRejectedExecutionException`, -your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes. +用 Marvel 以及诸如 `iostat` 、 `top` 和 `ps` 等工具监控你的节点,观察资源什么时候达到瓶颈。如果你开始收到 `EsRejectedExecutionException` ,你的集群没办法再继续了:至少有一种资源到瓶颈了。或者减少并发数,或者提供更多的受限资源(比如从机械磁盘换成 SSD),或者添加更多节点。 [NOTE] ==== -When ingesting data, make sure bulk requests are round-robined across all your -data nodes. Do not send all requests to a single node, since that single node -will need to store all the bulks in memory while processing. +写数据的时候,要确保批量请求是轮询发往你的全部数据节点的。不要把所有请求都发给单个节点,因为这个节点会需要在处理的时候把所有批量请求都存在内存里。 ==== -==== Storage +==== 存储 -Disks are usually the bottleneck of any modern server. Elasticsearch heavily uses disks, and the more throughput your disks can handle, the more stable your nodes will be. Here are some tips for optimizing disk I/O: +磁盘在现代服务器上通常都是瓶颈。Elasticsearch 重度使用磁盘,你的磁盘能处理的吞吐量越大,你的节点就越稳定。这里有一些优化磁盘 I/O 的技巧: -- Use SSDs. As mentioned elsewhere, ((("storage")))((("indexing", "performance tips", "storage")))they are superior to spinning media. -- Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of -potential failure if a drive dies. Don't use mirrored or parity RAIDS since -replicas provide that functionality. -- Alternatively, use multiple drives and allow Elasticsearch to stripe data across -them via multiple `path.data` directories. -- Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced -here is antithetical to performance. -- If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often slower -than local instance storage. +- 使用 SSD。就像其他地方提过的,((("storage")))((("indexing", "performance tips", "storage")))他们比机械磁盘优秀多了。 +- 使用 RAID 0。条带化 RAID 会提高磁盘 I/O,代价显然就是当一块硬盘故障时整个就故障了。不要使用镜像或者奇偶校验 RAID 因为副本已经提供了这个功能。 +- 另外,使用多块硬盘,并允许 Elasticsearch 通过多个 `path.data` 目录配置把数据条带化分配到它们上面。 +- 不要使用远程挂载的存储,比如 NFS 或者 SMB/CIFS。这个引入的延迟对性能来说完全是背道而驰的。 +- 如果你用的是 EC2,当心 EBS。即便是基于 SSD 的 EBS,通常也比本地实例的存储要慢。 [[segments-and-merging]] -==== Segments and Merging +==== 段和合并 -Segment merging is computationally expensive,((("indexing", "performance tips", "segments and merging")))((("merging segments")))((("segments", "merging"))) and can eat up a lot of disk I/O. -Merges are scheduled to operate in the background because they can take a long -time to finish, especially large segments. This is normally fine, because the -rate of large segment merges is relatively rare. +段合并的计算量庞大,((("indexing", "performance tips", "segments and merging")))((("merging segments")))((("segments", "merging")))而且还要吃掉大量磁盘 I/O。合并在后台定期操作,因为他们可能要很长时间才能完成,尤其是比较大的段。这个通常来说都没问题,因为大规模段合并的概率是很小的。 -But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch -will automatically throttle indexing requests to a single thread. This prevents -a _segment explosion_ problem, in which hundreds of segments are generated before -they can be merged. Elasticsearch will log `INFO`-level messages stating `now -throttling indexing` when it detects merging falling behind indexing. +不过有时候合并会拖累写入速率。如果这个真的发生了,Elasticsearch 会自动限制索引请求到单个线程里。这个可以防止出现 _段爆炸_ 问题,即数以百计的段在被合并之前就生成出来。如果 Elasticsearch 发现合并拖累索引了,它会会记录一个声明有 `now throttling indexing` 的 `INFO` 级别信息。 -Elasticsearch defaults here are conservative: you don't want search performance -to be impacted by background merging. But sometimes (especially on SSD, or logging -scenarios), the throttle limit is too low. +Elasticsearch 默认设置在这块比较保守:不希望搜索性能被后台合并影响。不过有时候(尤其是 SSD,或者日志场景)限流阈值太低了。 -The default is 20 MB/s, which is a good setting for spinning disks. If you have -SSDs, you might consider increasing this to 100–200 MB/s. Test to see what works -for your system: +默认值是 20 MB/s,对机械磁盘应该是个不错的设置。如果你用的是 SSD,可以考虑提高到 100–200 MB/s。测试验证对你的系统哪个值合适: [source,js] ---- @@ -117,9 +70,7 @@ PUT /_cluster/settings } ---- -If you are doing a bulk import and don't care about search at all, you can disable -merge throttling entirely. This will allow indexing to run as fast as your -disks will allow: +如果你在做批量导入,完全不在意搜索,你可以彻底关掉合并限流。这样让你的索引速度跑到你磁盘允许的极限: [source,js] ---- @@ -130,58 +81,31 @@ PUT /_cluster/settings } } ---- -<1> Setting the throttle type to `none` disables merge throttling entirely. When -you are done importing, set it back to `merge` to reenable throttling. +<1> 设置限流类型为 `none` 彻底关闭合并限流。等你完成了导入,记得改回 `merge` 重新打开限流。 -If you are using spinning media instead of SSD, you need to add this to your -`elasticsearch.yml`: +如果你使用的是机械磁盘而非 SSD,你需要添加下面这个配置到你的 `elasticsearch.yml` 里: [source,yaml] ---- index.merge.scheduler.max_thread_count: 1 ---- -Spinning media has a harder time with concurrent I/O, so we need to decrease -the number of threads that can concurrently access the disk per index. This setting -will allow `max_thread_count + 2` threads to operate on the disk at one time, -so a setting of `1` will allow three threads. - -For SSDs, you can ignore this setting. The default is -`Math.min(3, Runtime.getRuntime().availableProcessors() / 2)`, which works well -for SSD. - -Finally, you can increase `index.translog.flush_threshold_size` from the default -512 MB to something larger, such as 1 GB. This allows larger segments to accumulate -in the translog before a flush occurs. By letting larger segments build, you -flush less often, and the larger segments merge less often. All of this adds up -to less disk I/O overhead and better indexing rates. Of course, you will need -the corresponding amount of heap memory free to accumulate the extra buffering -space, so keep that in mind when adjusting this setting. - -==== Other - -Finally, there are some other considerations to keep in mind: - -- If you don't need near real-time accuracy on your search results, consider -dropping the `index.refresh_interval` of((("indexing", "performance tips", "other considerations")))((("refresh_interval setting"))) each index to `30s`. If you are doing -a large import, you can disable refreshes by setting this value to `-1` for the -duration of the import. Don't forget to reenable it when you are finished! - -- If you are doing a large bulk import, consider disabling replicas by setting -`index.number_of_replicas: 0`.((("replicas, disabling during large bulk imports"))) When documents are replicated, the entire document -is sent to the replica node and the indexing process is repeated verbatim. This -means each replica will perform the analysis, indexing, and potentially merging -process. +机械磁盘在并发 I/O 支持方面比较差,所以我们需要降低每个索引并发访问磁盘的线程数。这个设置允许 `max_thread_count + 2` 个线程同时进行磁盘操作,也就是设置为 `1` 允许三个线程。 + +对于 SSD,你可以忽略这个设置,默认是 `Math.min(3, Runtime.getRuntime().availableProcessors() / 2)` ,对 SSD 来说运行的很好。 + +最后,你可以增加 `index.translog.flush_threshold_size` 设置,从默认的 512 MB 到更大一些的值,比如 1 GB。这可以在一次清空触发的时候在事务日志里积累出更大的段。而通过构建更大的段,清空的频率变低,大段合并的频率也变低。这一切合起来导致更少的磁盘 I/O 开销和更好的索引速率。当然,你会需要对应量级的 heap 内存用以积累更大的缓冲空间,调整这个设置的时候请记住这点。 + +==== 其他 + +最后,还有一些其他值得考虑的东西需要记住: + +- 如果你的搜索结果不需要近实时的准确度,考虑把每个索引的 `index.refresh_interval`((("indexing", "performance tips", "other considerations")))((("refresh_interval setting")))改到 `30s` 。如果你是在做大批量导入,导入期间你可以通过设置这个值为 `-1` 关掉刷新。别忘记在完工的时候重新开启它。 + +- 如果你在做大批量导入,考虑通过设置 `index.number_of_replicas: 0`((("replicas, disabling during large bulk imports")))关闭副本。文档在复制的时候,整个文档内容都被发往副本节点,然后逐字的把索引过程重复一遍。这意味着每个副本也会执行分析、索引以及可能的合并过程。 + -In contrast, if you index with zero replicas and then enable replicas when ingestion -is finished, the recovery process is essentially a byte-for-byte network transfer. -This is much more efficient than duplicating the indexing process. - -- If you don't have a natural ID for each document, use Elasticsearch's auto-ID -functionality.((("id", "auto-ID functionality of Elasticsearch"))) It is optimized to avoid version lookups, since the autogenerated -ID is unique. - -- If you are using your own ID, try to pick an ID that is http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[friendly to Lucene]. ((("UUIDs (universally unique identifiers)"))) Examples include zero-padded -sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential -patterns that compress well. In contrast, IDs such as UUID-4 are essentially -random, which offer poor compression and slow down Lucene. +相反,如果你的索引是零副本,然后在写入完成后再开启副本,恢复过程本质上只是一个字节到字节的网络传输。相比重复索引过程,这个算是相当高效的了。 + +- 如果你没有给每个文档自带 ID,使用 Elasticsearch 的自动 ID 功能。((("id", "auto-ID functionality of Elasticsearch")))这个为避免版本查找做了优化,因为自动生成的 ID 是唯一的。 + +- 如果你在使用自己的 ID,尝试使用一种 http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[Lucene 友好的] ID。((("UUIDs (universally unique identifiers)")))包括零填充序列 ID、UUID-1 和纳秒;这些 ID 都是有一致的,压缩良好的序列模式。相反的,像 UUID-4 这样的 ID,本质上是随机的,压缩比很低,会明显拖慢 Lucene。 From 1a1c25a15fcdbe87c46b0251f00526c61c6a5a40 Mon Sep 17 00:00:00 2001 From: Rex Date: Sat, 22 Oct 2016 14:29:20 +0800 Subject: [PATCH 88/95] chapter38_part1:/320_Geohashes/40_Geohashes.asciidoc (#305) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * chapter38_part1:/320_Geohashes/40_Geohashes.asciidoc * 按review修改 * 按review修改 * 添加空格 --- 320_Geohashes/40_Geohashes.asciidoc | 40 ++++++++--------------------- 1 file changed, 10 insertions(+), 30 deletions(-) diff --git a/320_Geohashes/40_Geohashes.asciidoc b/320_Geohashes/40_Geohashes.asciidoc index e756ab090..35bdbb820 100644 --- a/320_Geohashes/40_Geohashes.asciidoc +++ b/320_Geohashes/40_Geohashes.asciidoc @@ -1,34 +1,15 @@ [[geohashes]] == Geohashes -http://en.wikipedia.org/wiki/Geohash[Geohashes] are a way of encoding -`lat/lon` points as strings.((("geohashes")))((("latitude/longitude pairs", "encoding lat/lon points as strings with geohashes")))((("strings", "geohash"))) The original intention was to have a -URL-friendly way of specifying geolocations, but geohashes have turned out to -be a useful way of indexing geo-points and geo-shapes in databases. - -Geohashes divide the world into a grid of 32 cells--4 rows and 8 columns--each represented by a letter or number. The `g` cell covers half of -Greenland, all of Iceland, and most of Great Britian. Each cell can be further -divided into another 32 cells, which can be divided into another 32 cells, -and so on. The `gc` cell covers Ireland and England, `gcp` covers most of -London and part of Southern England, and `gcpuuz94k` is the entrance to -Buckingham Palace, accurate to about 5 meters. - -In other words, the longer the geohash string, the more accurate it is. If -two geohashes share a prefix— and `gcpuuz`—then it implies that -they are near each other. The longer the shared prefix, the closer they -are. - -That said, two locations that are right next to each other may have completely -different geohashes. For instance, the -http://en.wikipedia.org/wiki/Millennium_Dome[Millenium Dome] in London has -geohash `u10hbp`, because it falls into the `u` cell, the next top-level cell -to the east of the `g` cell. - -Geo-points can index their associated geohashes automatically, but more -important, they can also index all geohash _prefixes_. Indexing the location -of the entrance to Buckingham Palace--latitude `51.501568` and longitude -`-0.141257`—would index all of the geohashes listed in the following table, -along with the approximate dimensions of each geohash cell: +http://en.wikipedia.org/wiki/Geohash[Geohashes] 是一种将经纬度坐标( `lat/lon` )编码成字符串的方式。((("geohashes")))((("latitude/longitude pairs", "encoding lat/lon points as strings with geohashes")))((("strings", "geohash")))这么做的初衷只是为了让地理位置在 url 上呈现的形式更加友好,但现在 geohashes 已经变成一种在数据库中有效索引地理坐标点和地理形状的方式。 + +Geohashes 把整个世界分为 32 个单元的格子 —— 4 行 8 列 —— 每一个格子都用一个字母或者数字标识。比如 `g` 这个单元覆盖了半个格林兰,冰岛的全部和大不列颠的大部分。每一个单元还可以进一步被分解成新的 32 个单元,这些单元又可以继续被分解成 32 个更小的单元,不断重复下去。 `gc` 这个单元覆盖了爱尔兰和英格兰, `gcp` 覆盖了伦敦的大部分和部分南英格兰, `gcpuuz94k` 是白金汉宫的入口,精确到约 5 米。 + +换句话说, geohash 的长度越长,它的精度就越高。如果两个 geohashes 有一个共同的前缀— `gcpuuz`—就表示他们挨得很近。共同的前缀越长,距离就越近。 + +这也意味着,两个刚好相邻的位置,可能会有完全不同的 geohash 。比如,伦敦 http://en.wikipedia.org/wiki/Millennium_Dome[Millenium Dome] 的 geohash 是 `u10hbp` ,因为它落在了 `u` 这个单元里,而紧挨着它东边的最大的单元是 `g` 。 + +地理坐标点可以自动索引相关的 geohashes ,更重要的是,他们也可以索引所有的 geohashes _前缀_ 。如索引白金汉宫入口位置——纬度 `51.501568` ,经度 `-0.141257`—将会索引下面表格中列出的所有 geohashes ,表格中也给出了各个 geohash 单元的近似尺寸: [cols="1m,1m,3d",options="header"] |============================================= @@ -47,6 +28,5 @@ along with the approximate dimensions of each geohash cell: |gcpuuz94kkp5 |12 | ~ 3.7cm x 1.8cm |============================================= -The {ref}/query-dsl-geohash-cell-query.html[`geohash_cell` filter] can use -these geohash prefixes((("geohash_cell filter")))((("filters", "geohash_cell"))) to find locations near a specified `lat/lon` point. +{ref}/query-dsl-geohash-cell-query.html[`geohash单元` 过滤器] 可以使用这些 geohash 前缀((("geohash_cell filter")))((("filters", "geohash_cell")))来找出与指定坐标点( `lat/lon` )相邻的位置。 From 175866d61b75f842ffff4682d451dcb730960e8c Mon Sep 17 00:00:00 2001 From: weiqiangyuan Date: Sat, 22 Oct 2016 14:32:37 +0800 Subject: [PATCH 89/95] chapter43_part4: /404_Parent_Child/55_Has_parent.asciidoc (#277) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 完成Parent-Child第4部分的翻译 * revise according comment * revise according comment * save --- 404_Parent_Child/55_Has_parent.asciidoc | 27 ++++++++----------------- 1 file changed, 8 insertions(+), 19 deletions(-) diff --git a/404_Parent_Child/55_Has_parent.asciidoc b/404_Parent_Child/55_Has_parent.asciidoc index fcc37e404..d616e2d17 100644 --- a/404_Parent_Child/55_Has_parent.asciidoc +++ b/404_Parent_Child/55_Has_parent.asciidoc @@ -1,14 +1,9 @@ [[has-parent]] -=== Finding Children by Their Parents +=== 通过父文档查询子文档 -While a `nested` query can always ((("parent-child relationship", "finding children by their parents")))return only the root document as a result, -parent and child documents are independent and each can be queried -independently. The `has_child` query allows us to return parents based on -data in their children, and the `has_parent` query returns children based on -data in their parents.((("has_parent query and filter", "query"))) +虽然 `nested` 查询只能返回最顶层的文档 ((("parent-child relationship", "finding children by their parents"))),但是父文档和子文档本身是彼此独立并且可被单独查询的。我们使用 `has_child` 语句可以基于子文档来查询父文档,使用 `has_parent` 语句可以基于子文档来查询父文档。 ((("has_parent query and filter", "query"))) -It looks very similar to the `has_child` query. This example returns -employees who work in the UK: +`has_parent` 和 `has_child` 非常相似,下面的查询将会返回所有在 UK 工作的雇员: [source,json] ------------------------- @@ -26,19 +21,13 @@ GET /company/employee/_search } } ------------------------- -<1> Returns children who have parents of type `branch` +<1> 返回父文档 `type` 是 `branch` 的所有子文档 -The `has_parent` query also supports the `score_mode`,((("score_mode parameter"))) but it accepts only two -settings: `none` (the default) and `score`. Each child can have only one -parent, so there is no need to reduce multiple scores into a single score for -the child. The choice is simply between using the score (`score`) or not -(`none`). +`has_parent` 查询也支持 `score_mode` 这个参数,((("score_mode parameter")))但是该参数只支持两种值: `none` (默认)和 `score` 。每个子文档都只有一个父文档,因此这里不存在将多个评分规约为一个的情况, `score_mode` 的取值仅为 `score` 和 `none` 。 -.Non-scoring has_parent Query +.不带评分的 has_parent 查询 ************************** -When used in non-scoring mode (e.g. inside a `filter` clause), the `has_parent` -query no longer supports the `score_mode` parameter. Because it is merely -including/excluding documents and not scoring, the `score_mode` parameter -no longer applies. +当 `has_parent` 查询用于非评分模式(比如 filter 查询语句)时, `score_mode` 参数就不再起作用了。因为这种模式只是简单地包含或排除文档,没有评分,那么 `score_mode` 参数也就没有意义了。 + ************************** From e42e492e9c1d1611975d7ddb14bfd1b7d77b7c26 Mon Sep 17 00:00:00 2001 From: weiqiangyuan Date: Sat, 22 Oct 2016 14:37:27 +0800 Subject: [PATCH 90/95] chapter43_part1: /404_Parent_Child/40_Parent_child.asciidoc (#275) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 完成Parent-Child第1部分的翻译 * revise nested model name * reisve pull request according to first inner team review * reisve pull request according to first inner team review * revise docvalue and maps * revise 1 to 一 --- 404_Parent_Child/40_Parent_child.asciidoc | 54 ++++++----------------- 1 file changed, 13 insertions(+), 41 deletions(-) diff --git a/404_Parent_Child/40_Parent_child.asciidoc b/404_Parent_Child/40_Parent_child.asciidoc index 15b64a23e..87cc431f1 100644 --- a/404_Parent_Child/40_Parent_child.asciidoc +++ b/404_Parent_Child/40_Parent_child.asciidoc @@ -1,54 +1,26 @@ [[parent-child]] -== Parent-Child Relationship +== 父-子关系文档 -The _parent-child_ relationship is ((("relationships", "parent-child")))((("parent-child relationship")))similar in nature to the -<>: both allow you to associate one entity -with another. ((("nested objects", "parent-child relationships versus")))The difference is that, with nested objects, all entities live -within the same document while, with parent-child, the parent and children -are completely separate documents. +父-子关系文档 ((("relationships", "parent-child"))) ((("parent-child relationship"))) 在实质上类似于 <> :允许将一个对象实体和另外一个对象实体关联起来。((("nested objects", "parent-child relationships versus")))而这两种类型的主要区别是:在 <> 文档中,所有对象都是在同一个文档中,而在父-子关系文档中,父对象和子对象都是完全独立的文档。 -The parent-child functionality allows you to associate one document type with -another, in a _one-to-many_ relationship--one parent to many children.((("one-to-many relationships"))) The -advantages that parent-child has over <> are as follows: +父-子关系的主要作用是允许把一个 type 的文档和另外一个 type 的文档关联起来,构成一对多的关系:一个父文档可以对应多个子文档 ((("one-to-many relationships"))) 。与 <> 相比,父-子关系的主要优势有: -* The parent document can be updated without reindexing the children. +* 更新父文档时,不会重新索引子文档。 +* 创建,修改或删除子文档时,不会影响父文档或其他子文档。这一点在这种场景下尤其有用:子文档数量较多,并且子文档创建和修改的频率高时。 +* 子文档可以作为搜索结果独立返回。 -* Child documents can be added, changed, or deleted without affecting either - the parent or other children. This is especially useful when child documents - are large in number and need to be added or changed frequently. - -* Child documents can be returned as the results of a search request. - -Elasticsearch maintains a map of which parents are associated with -which children. It is thanks to this map that query-time joins are fast, but -it does place a limitation on the parent-child relationship: _the parent -document and all of its children must live on the same shard_. - -The parent-child ID maps are stored in <>, which allows them to execute -quickly when fully hot in memory, but scalable enough to spill to disk when -the map is very large. +Elasticsearch 维护了一个父文档和子文档的映射关系,得益于这个映射,父-子文档关联查询操作非常快。但是这个映射也对父-子文档关系有个限制条件:父文档和其所有子文档,都必须要存储在同一个分片中。 +父-子文档ID映射存储在 <> 中。当映射完全在内存中时, <> 提供对映射的快速处理能力,另一方面当映射非常大时,可以通过溢出到磁盘提供足够的扩展能力 [[parent-child-mapping]] -=== Parent-Child Mapping +=== 父-子关系文档映射 -All that is needed in order to establish the parent-child relationship is to -specify which document type should be the parent of a child type.((("mapping (types)", "parent-child")))((("parent-child relationship", "parent-child mapping"))) This must -be done at index creation time, or with the `update-mapping` API before the -child type has been created. +建立父-子文档映射关系时只需要指定某一个文档 type 是另一个文档 type 的父亲。 ((("mapping (types)", "parent-child"))) ((("parent-child relationship", "parent-child mapping"))) 该关系可以在如下两个时间点设置:1)创建索引时;2)在子文档 type 创建之前更新父文档的 mapping。 -As an example, let's say that we have a company that has branches in many -cities. We would like to associate employees with the branch where they work. -We need to be able to search for branches, individual employees, and employees -who work for particular branches, so the nested model will not help. We -could, of course, -use <> or -<> here instead, but for demonstration -purposes we will use parent-child. +举例说明,有一个公司在多个城市有分公司,并且每一个分公司下面都有很多员工。有这样的需求:按照分公司、员工的维度去搜索,并且把员工和他们工作的分公司联系起来。针对该需求,用嵌套模型是无法实现的。当然,如果使用 <> 或者 <> 也是可以实现的,但是为了演示的目的,在这里我们使用父-子文档。 -All that we have to do is to tell Elasticsearch that the `employee` type has -the `branch` document type as its `_parent`, which we can do when we create -the index: +我们需要告诉Elasticsearch,在创建员工 `employee` 文档 type 时,指定分公司 `branch` 的文档 type 为其父亲。 [source,json] ------------------------- @@ -64,4 +36,4 @@ PUT /company } } ------------------------- -<1> Documents of type `employee` are children of type `branch`. +<1> `employee` 文档 是 `branch` 文档的子文档。 From 6e640484f30fe1134a5e31fdf5dc6ea975cb8aac Mon Sep 17 00:00:00 2001 From: weiqiangyuan Date: Sat, 22 Oct 2016 14:40:41 +0800 Subject: [PATCH 91/95] chapter43_part3: /404_Parent_Child/50_Has_child.asciidoc (#276) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 完成Parent-Child第3部分的翻译 * revise according comment * revise according comment * revise according comment * revise according comment --- 404_Parent_Child/50_Has_child.asciidoc | 47 +++++++------------------- 1 file changed, 13 insertions(+), 34 deletions(-) diff --git a/404_Parent_Child/50_Has_child.asciidoc b/404_Parent_Child/50_Has_child.asciidoc index b448b202a..eb01e36f6 100644 --- a/404_Parent_Child/50_Has_child.asciidoc +++ b/404_Parent_Child/50_Has_child.asciidoc @@ -1,10 +1,7 @@ [[has-child]] -=== Finding Parents by Their Children - -The `has_child` query and filter can be used to find parent documents based on -the contents of their children.((("has_child query and filter")))((("parent-child relationship", "finding parents by their children"))) For instance, we could find all branches that -have employees born after 1980 with a query like this: +=== 通过子文档查询父文档 +`has_child` 的查询和过滤可以通过子文档的内容来查询父文档。((("has_child query and filter")))((("parent-child relationship", "finding parents by their children")))例如,我们根据如下查询,可查出所有80后员工所在的分公司: [source,json] ------------------------- GET /company/branch/_search @@ -24,16 +21,10 @@ GET /company/branch/_search } ------------------------- -Like the <>, the `has_child` query could -match several child documents,((("has_child query and filter", "query"))) each with a different relevance -score. How these scores are reduced to a single score for the parent document -depends on the `score_mode` parameter. The default setting is `none`, which -ignores the child scores and assigns a score of `1.0` to the parents, but it -also accepts `avg`, `min`, `max`, and `sum`. +类似于 <> ,`has_child` 查询可以匹配多个子文档((("has_child query and filter", "query"))),并且每一个子文档的评分都不同。但是由于每一个子文档都带有评分,这些评分如何规约成父文档的总得分取决于 `score_mode` 这个参数。该参数有多种取值策略:默认为 `none` ,会忽略子文档的评分,并且会给父文档评分设置为 `1.0` ; +除此以外还可以设置成 `avg` 、 `min` 、 `max` 和 `sum` 。 -The following query will return both `london` and `liverpool`, but `london` -will get a better score because `Alice Smith` is a better match than -`Barry Smith`: +下面的查询将会同时返回 `london` 和 `liverpool` ,不过由于 `Alice Smith` 要比 `Barry Smith` 更加匹配查询条件,因此 `london` 会得到一个更高的评分。 [source,json] ------------------------- @@ -53,19 +44,14 @@ GET /company/branch/_search } ------------------------- -TIP: The default `score_mode` of `none` is significantly faster than the other -modes because Elasticsearch doesn't need to calculate the score for each child -document. Set it to `avg`, `min`, `max`, or `sum` only if you care about the -score.((("parent-child relationship", "finding parents by their children", "min_children and max_children"))) +TIP: `score_mode` 为默认的 `none` 时,会显著地比其模式要快,这是因为Elasticsearch不需要计算每一个子文档的评分。只有当你真正需要关心评分结果时,才需要为 `source_mode` 设值,例如设成 `avg` 、 `min` 、 `max` 或 `sum` 。((("parent-child relationship", "finding parents by their children", "min_children and max_children"))) [[min-max-children]] -==== min_children and max_children +==== min_children 和 max_children -The `has_child` query and filter both accept the `min_children` and -`max_children` parameters,((("min_children parameter")))((("max_children parameter")))((("has_child query and filter", "min_children or max_children parameters"))) which will return the parent document only if the -number of matching children is within the specified range. +`has_child` 的查询和过滤都可以接受这两个参数:`min_children` 和 `max_children` 。 ((("min_children parameter")))((("max_children parameter")))((("has_child query and filter", "min_children or max_children parameters"))) 使用这两个参数时,只有当子文档数量在指定范围内时,才会返回父文档。 -This query will match only branches that have at least two employees: +如下查询只会返回至少有两个雇员的分公司: [source,json] ------------------------- @@ -82,21 +68,14 @@ GET /company/branch/_search } } ------------------------- -<1> A branch must have at least two employees in order to match. +<1> 至少有两个雇员的分公司才会符合查询条件。 -The performance of a `has_child` query or filter with the `min_children` or -`max_children` parameters is much the same as a `has_child` query with scoring -enabled. +带有 `min_children` 和 `max_children` 参数的 `has_child` 查询或过滤,和允许评分的 `has_child` 查询的性能非常接近。 .has_child Filter ************************** -The `has_child` filter works((("has_child query and filter", "filter"))) in the same way as the `has_child` query, except -that it doesn't support the `score_mode` parameter. It can be used only in -_filter context_—such as inside a `filtered` query--and behaves -like any other filter: it includes or excludes, but doesn't score. - -While the results of a `has_child` filter are not cached, the usual caching -rules apply to the filter _inside_ the `has_child` filter. +`has_child` 查询和过滤在运行机制上类似,((("has_child query and filter", "filter")))区别是 `has_child` 过滤不支持 `source_mode` 参数。`has_child` 过滤仅用于筛选内容--如内部的一个 `filtered` 查询--和其他过滤行为类似:包含或者排除,但没有进行评分。 +`has_child` 过滤的结果没有被缓存,但是 `has_child` 过滤内部的过滤方法适用于通常的缓存规则。 ************************** From 712085a8c8de4b9fd32f70eace6a2ee40cc44c51 Mon Sep 17 00:00:00 2001 From: weiqiangyuan Date: Sat, 22 Oct 2016 14:41:58 +0800 Subject: [PATCH 92/95] chapter43_part5: /404_Parent_Child/60_Children_agg.asciidoc (#278) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * 完成Parent-Child第5部分的翻译 * revise according comment --- 404_Parent_Child/60_Children_agg.asciidoc | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/404_Parent_Child/60_Children_agg.asciidoc b/404_Parent_Child/60_Children_agg.asciidoc index 6af80f0ec..0363fa2e6 100644 --- a/404_Parent_Child/60_Children_agg.asciidoc +++ b/404_Parent_Child/60_Children_agg.asciidoc @@ -1,14 +1,10 @@ [[children-agg]] -=== Children Aggregation +=== 子文档聚合 -Parent-child supports a -http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html[`children` aggregation] as ((("aggregations", "children aggregation")))((("children aggregation")))((("parent-child relationship", "children aggregation")))a direct analog to the `nested` aggregation discussed in -<>. A parent aggregation (the equivalent of -`reverse_nested`) is not supported. - -This example demonstrates how we could determine the favorite hobbies of our -employees by country: +在父-子文档中支持 +http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html[子文档聚合],这一点和((("aggregations", "children aggregation")))((("children aggregation")))((("parent-child relationship", "children aggregation"))) <> 类似。但是,对于父文档的聚合查询是不支持的(和 `reverse_nested` 类似)。 +我们通过下面的例子来演示按照国家维度查看最受雇员欢迎的业余爱好: [source,json] ------------------------- GET /company/branch/_search @@ -37,7 +33,6 @@ GET /company/branch/_search } } ------------------------- -<1> The `country` field in the `branch` documents. -<2> The `children` aggregation joins the parent documents with - their associated children of type `employee`. -<3> The `hobby` field from the `employee` child documents. +<1> `country` 是 `branch` 文档的一个字段。 +<2> 子文档聚合查询通过 `employee` type 的子文档将其父文档聚合在一起。 +<3> `hobby` 是 `employee` 子文档的一个字段。 From 95ad256e38d9cd16e3d815bcd0737f740df79e1b Mon Sep 17 00:00:00 2001 From: Medcl Date: Sat, 22 Oct 2016 21:27:46 +0800 Subject: [PATCH 93/95] fix filename (#325) --- 070_Index_Mgmt/10_Settings.asciidoc | 3 +-- 130_Partial_Matching/05_Postcodes.asciidoc | 1 + 510_Deployment/45_dont_touch.asciidoc | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/070_Index_Mgmt/10_Settings.asciidoc b/070_Index_Mgmt/10_Settings.asciidoc index 88da77ec6..439f7ff5f 100644 --- a/070_Index_Mgmt/10_Settings.asciidoc +++ b/070_Index_Mgmt/10_Settings.asciidoc @@ -1,3 +1,4 @@ +[[index-settings]] === 索引设置 你可以通过修改配置来((("index settings")))自定义索引行为,详细配置参照 @@ -40,5 +41,3 @@ PUT /my_temp_index/_settings } -------------------------------------------------- // SENSE: 070_Index_Mgmt/10_Settings.json - - diff --git a/130_Partial_Matching/05_Postcodes.asciidoc b/130_Partial_Matching/05_Postcodes.asciidoc index 18b27e22f..0b22b17e9 100644 --- a/130_Partial_Matching/05_Postcodes.asciidoc +++ b/130_Partial_Matching/05_Postcodes.asciidoc @@ -1,3 +1,4 @@ +[[postcodes-and-structured-data]] === 邮编与结构化数据 我们会使用美国目前使用的邮编形式(United Kingdom postcodes 标准)来说明如何用部分匹配查询结构化数据。((("partial matching", "postcodes and structured data")))这种邮编形式有很好的结构定义。例如,邮编 `W1V 3DG` 可以分解成如下形式:((("postcodes (UK), partial matching with"))) diff --git a/510_Deployment/45_dont_touch.asciidoc b/510_Deployment/45_dont_touch.asciidoc index 45721760f..ec78f4bd4 100644 --- a/510_Deployment/45_dont_touch.asciidoc +++ b/510_Deployment/45_dont_touch.asciidoc @@ -1,4 +1,4 @@ - +[[dont-touch-these-settings]] === 不要触碰这些配置! 在 Elasticsearch 中有一些热点,人们可能不可避免的会碰到。((("deployment", "settings to leave unaltered"))) 我们理解的,所有的调整就是为了优化,但是这些调整,你真的不需要理会它。因为它们经常会被乱用,从而造成系统的不稳定或者糟糕的性能,甚至两者都有可能。 From 774df158e59528f650c37808bcd028d0e20b8c98 Mon Sep 17 00:00:00 2001 From: "feng.wei" Date: Sat, 22 Oct 2016 22:32:59 +0800 Subject: [PATCH 94/95] 10_Multi --- 050_Search/10_Multi_index_multi_type.asciidoc | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/050_Search/10_Multi_index_multi_type.asciidoc b/050_Search/10_Multi_index_multi_type.asciidoc index 0b7a0f378..77fff07c4 100644 --- a/050_Search/10_Multi_index_multi_type.asciidoc +++ b/050_Search/10_Multi_index_multi_type.asciidoc @@ -1,11 +1,11 @@ [[multi-index-multi-type]] === 多索引,多类型 -你有没有注意到之前的 <> 的结果包含从两个不同索引下 — `us` and `gb` 的不同类型 `user` and `tweet` 的文档? +你有没有注意到之前的 <> 的结果,不同类型的文档((("searching", "multi-index, multi-type search")))— `user` 和 `tweet` 来自不同的索引— `us` 和 `gb` ? -如果不对某一特殊的索引或者类型做限制性的搜索,就会搜索集群中的所有文档。Elasticsearch 转发搜索请求到每一个主分片或者副本分片,汇集查询出的前10个结果,并且返回给我们。 +如果不对某一特殊的索引或者类型做限制,就会搜索集群中的所有文档。Elasticsearch 转发搜索请求到每一个主分片或者副本分片,汇集查询出的前10个结果,并且返回给我们。 -然而,经常的情况下,你想在一个或多个特殊的索引并且在一个或者多个特殊的类型中进行搜索。我们可以通过在URL中指定特殊的索引和类型达到这种效果,如下所示: +然而,经常的情况下,你((("types", "specifying in search requests")))(((" indices", "specifying in search requests")))想在一个或多个特殊的索引并且在一个或者多个特殊的类型中进行搜索。我们可以通过在URL中指定特殊的索引和类型达到这种效果,如下所示: `/_search`:: @@ -35,8 +35,8 @@ [TIP] ================================================ -搜索一个有五个主分片的索引和搜索只有一个主分片的五个索引准确来所说是等价的。 +搜索一个索引有五个主分片和搜索五个索引各有一个分片准确来所说是等价的。 ================================================ -最后,你将明白这种简单的方式如何弹性的把请求的变化变得简单化。 +接下来,你将明白这种简单的方式如何弹性的把请求的变化变得简单化。 From 6acb4e434a437514c52b76896067b406eea78e6f Mon Sep 17 00:00:00 2001 From: Medcl Date: Sat, 22 Oct 2016 23:17:37 +0800 Subject: [PATCH 95/95] Revert "chapter05_part2:/050_Search/10_Multi_index_multi_type.asciidoc" (#326) --- 050_Search/05_Empty_search.asciidoc | 54 +++++++++--- 050_Search/10_Multi_index_multi_type.asciidoc | 40 +++++---- 050_Search/15_Pagination.asciidoc | 41 +++++++--- 050_Search/20_Query_string.asciidoc | 82 +++++++++++++------ 4 files changed, 155 insertions(+), 62 deletions(-) diff --git a/050_Search/05_Empty_search.asciidoc b/050_Search/05_Empty_search.asciidoc index a12f507d1..25cb69a86 100644 --- a/050_Search/05_Empty_search.asciidoc +++ b/050_Search/05_Empty_search.asciidoc @@ -1,14 +1,17 @@ [[empty-search]] === The Empty Search -搜索API的最基础的形式是没有指定任何查询的空搜索,它简单地返回集群中所有目录中的所有文档: +The most basic form of the((("searching", "empty search")))((("empty search"))) search API is the _empty search_, which doesn't +specify any query but simply returns all documents in all indices in the +cluster: [source,js] -------------------------------------------------- GET /_search -------------------------------------------------- +// SENSE: 050_Search/05_Empty_search.json -返回的结果(为了解决编辑过的)像这种这样子: +The response (edited for brevity) looks something like this: [source,js] -------------------------------------------------- @@ -45,39 +48,66 @@ GET /_search ==== hits -返回结果中最重的部分是 `hits` ,它包含与我们查询相匹配的文档总数 `total` ,并且一个 `hits` 数组包含所查询结果的前十个文档。 +The most important section of the response is `hits`, which((("searching", "empty search", "hits")))((("hits"))) contains the +`total` number of documents that matched our query, and a `hits` array +containing the first 10 of those matching documents--the results. -在 `hits` 数组中每个结果包含文档的 `_index` 、 `_type` 、 `_id` ,加上 `_source` 字段。这意味着我们可以直接从返回的搜索结果中使用整个文档。这不像其他的搜索引擎,仅仅返回文档的ID,获取对应的文档需要在单独的步骤。 +Each result in the `hits` array contains the `_index`, `_type`, and `_id` of +the document, plus the `_source` field. This means that the whole document is +immediately available to us directly from the search results. This is unlike +other search engines, which return just the document ID, requiring you to fetch +the document itself in a separate step. -每个结果还有一个 `_score` ,这是衡量文档与查询匹配度的关联性分数。默认情况下,首先返回最相关的文档结果,就是说,返回的文档是按照 `_score` 降序排列的。在这个例子中,我们没有指定任何查询,故所有的文档具有相同的相关性,因此对所有的结果而言 `1` 是中性的 `_score` 。 +Each element also ((("score", "for empty search")))((("relevance scores")))has a `_score`. This is the _relevance score_, which is a +measure of how well the document matches the query. By default, results are +returned with the most relevant documents first; that is, in descending order +of `_score`. In this case, we didn't specify any query, so all documents are +equally relevant, hence the neutral `_score` of `1` for all results. -`max_score` 值是与查询所匹配文档的最高 `_score` 。 +The `max_score` value is the highest `_score` of any document that matches our +query.((("max_score value"))) ==== took -`took` 值告诉我们执行整个搜索请求耗费了多少毫秒。 +The `took` value((("took value (empty search)"))) tells us how many milliseconds the entire search request took +to execute. ==== shards -`_shards` 部分告诉我们在查询中参与分片的总数,以及这些分片成功了多少个失败了多少个。正常情况下我们不希望分片失败,但是分片失败是可能发生的。如果我们遭遇到一种较常见的灾难,在这个灾难中丢失了相同分片的原始数据和副本,那么对这个分片将没有可用副本来对搜索请求作出响应。假若这样,Elasticsearch 将报告这个分片是失败的,但是会继续返回剩余分片的结果。 +The `_shards` element((("shards", "number involved in an empty search"))) tells us the `total` number of shards that were involved +in the query and,((("failed shards (in a search)")))((("successful shards (in a search)"))) of them, how many were `successful` and how many `failed`. +We wouldn't normally expect shards to fail, but it can happen. If we were to +suffer a major disaster in which we lost both the primary and the replica copy +of the same shard, there would be no copies of that shard available to respond +to search requests. In this case, Elasticsearch would report the shard as +`failed`, but continue to return results from the remaining shards. ==== timeout -`timed_out` 值告诉我们查询是否超时。默认情况下,搜索请求不会超时。如果低响应时间比完成结果更重要,你可以指定 `timeout` 为10或者10ms(10毫秒),或者1s(1秒): +The `timed_out` value tells((("timed_out value in search results"))) us whether the query timed out. By +default, search requests do not time out.((("timeout parameter", "specifying in a request"))) If low response times are more +important to you than complete results, you can specify a `timeout` as `10` +or `10ms` (10 milliseconds), or `1s` (1 second): [source,js] -------------------------------------------------- GET /_search?timeout=10ms -------------------------------------------------- -在请求超时之前,Elasticsearch 将返回从每个分片聚集来的结果。 + +Elasticsearch will return any results that it has managed to gather from +each shard before the requests timed out. [WARNING] ================================================ -应当注意的是 `timeout` 不是停止执行查询,它仅仅是告知正在协调的节点返回到目前为止收集的结果并且关闭连接。在后台,其他的分片可能仍在执行查询即使是结果已经被发送了。 +It should be noted that this `timeout` does not((("timeout parameter", "not halting query execution"))) halt the execution of the +query; it merely tells the coordinating node to return the results collected +_so far_ and to close the connection. In the background, other shards may +still be processing the query even though results have been sent. -使用超时是因为对你的SLA是重要的,不是因为想去中止长时间运行的查询。 +Use the time-out because it is important to your SLA, not because you want +to abort the execution of long-running queries. ================================================ diff --git a/050_Search/10_Multi_index_multi_type.asciidoc b/050_Search/10_Multi_index_multi_type.asciidoc index 77fff07c4..d865bff0d 100644 --- a/050_Search/10_Multi_index_multi_type.asciidoc +++ b/050_Search/10_Multi_index_multi_type.asciidoc @@ -1,42 +1,54 @@ [[multi-index-multi-type]] -=== 多索引,多类型 +=== Multi-index, Multitype -你有没有注意到之前的 <> 的结果,不同类型的文档((("searching", "multi-index, multi-type search")))— `user` 和 `tweet` 来自不同的索引— `us` 和 `gb` ? +Did you notice that the results from the preceding <> +contained documents ((("searching", "multi-index, multi-type search")))of different types—`user` and `tweet`—from two +different indices—`us` and `gb`? -如果不对某一特殊的索引或者类型做限制,就会搜索集群中的所有文档。Elasticsearch 转发搜索请求到每一个主分片或者副本分片,汇集查询出的前10个结果,并且返回给我们。 +By not limiting our search to a particular index or type, we have searched +across _all_ documents in the cluster. Elasticsearch forwarded the search +request in parallel to a primary or replica of every shard in the cluster, +gathered the results to select the overall top 10, and returned them to us. -然而,经常的情况下,你((("types", "specifying in search requests")))(((" indices", "specifying in search requests")))想在一个或多个特殊的索引并且在一个或者多个特殊的类型中进行搜索。我们可以通过在URL中指定特殊的索引和类型达到这种效果,如下所示: +Usually, however, you will((("types", "specifying in search requests")))((("indices", "specifying in search requests"))) want to search within one or more specific indices, +and probably one or more specific types. We can do this by specifying the +index and type in the URL, as follows: `/_search`:: - 在所有的索引中搜索所有的类型 + Search all types in all indices `/gb/_search`:: - 在 `gb` 索引中搜索所有的类型 + Search all types in the `gb` index `/gb,us/_search`:: - 在 `gb` 和 `us` 索引中搜索所有的文档 + Search all types in the `gb` and `us` indices `/g*,u*/_search`:: - 在任何以 `g` 或者 `u` 开头的索引中搜索所有的类型 + Search all types in any indices beginning with `g` or beginning with `u` `/gb/user/_search`:: - 在 `gb` 索引中搜索 `user` 类型 + Search type `user` in the `gb` index `/gb,us/user,tweet/_search`:: - 在 `gb` 和 `us` 索引中搜索 `user` 和 `tweet` 类型 + Search types `user` and `tweet` in the `gb` and `us` indices `/_all/user,tweet/_search`:: - 在所有的索引中搜索 `user` 和 `tweet` 类型 + Search types `user` and `tweet` in all indices -当在单一的索引下进行搜索的时候,Elasticsearch 转发请求到索引的每个分片中,可以是主分片也可以是副本分片,然后从每个分片中收集结果。多索引搜索恰好也是用相同的方式工作的--只是会涉及到更多的分片。 +When you search within a single index, Elasticsearch forwards the search +request to a primary or replica of every shard in that index, and then gathers the +results from each shard. Searching within multiple indices works in exactly +the same way--there are just more shards involved. [TIP] ================================================ -搜索一个索引有五个主分片和搜索五个索引各有一个分片准确来所说是等价的。 +Searching one index that has five primary shards is _exactly equivalent_ to +searching five indices that have one primary shard each. ================================================ -接下来,你将明白这种简单的方式如何弹性的把请求的变化变得简单化。 +Later, you will see how this simple fact makes it easy to scale flexibly +as your requirements change. diff --git a/050_Search/15_Pagination.asciidoc b/050_Search/15_Pagination.asciidoc index 8a8511bae..6123cf73b 100644 --- a/050_Search/15_Pagination.asciidoc +++ b/050_Search/15_Pagination.asciidoc @@ -1,17 +1,21 @@ [[pagination]] -=== 分页 +=== Pagination -在之前的 <> 中知道集群中有14个文档匹配了我们(empty)query。但是在 `hits` 数组中只有10个文档,怎么样我们才能看到其他的文档呢? +Our preceding <> told us that 14 documents in the((("pagination"))) +cluster match our (empty) query. But there were only 10 documents in +the `hits` array. How can we see the other documents? -像SQL使用 `LIMIT` 关键字返回单页的结果一样,Elasticsearch 有 `from` 和 `size` 参数: +In the same way as SQL uses the `LIMIT` keyword to return a single ``page'' of +results, Elasticsearch accepts ((("from parameter")))((("size parameter")))the `from` and `size` parameters: `size`:: - 显示应该返回的结果数量,默认是 `10` + Indicates the number of results that should be returned, defaults to `10` `from`:: - 显示应该跳过的初始结果数量,默认是 `0` + Indicates the number of initial results that should be skipped, defaults to `0` -如果想每页展示五条结果,可以用下面三种方式请求: +If you wanted to show five results per page, then pages 1 to 3 +could be requested as follows: [source,js] -------------------------------------------------- @@ -22,17 +26,30 @@ GET /_search?size=5&from=10 // SENSE: 050_Search/15_Pagination.json -考虑到分页太深或者请求太多结果的情况,在返回结果之前可以对结果排序。但是请记住一个请求经常跨越多个分片,每个分片都产生自己的排序结果,这些结果需要进行集中排序以保证全部的次序是正确的。 +Beware of paging too deep or requesting too many results at once. Results are +sorted before being returned. But remember that a search request usually spans +multiple shards. Each shard generates its own sorted results, which then need +to be sorted centrally to ensure that the overall order is correct. -.在分布式系统中深度分页 +.Deep Paging in Distributed Systems **** -理解问什么深度分页是有问题的,我们可以想象搜索有五个主分片的单一索引。当我们请求结果的第一页(结果从1到10),每一个分片产生前10的结果,并且返回给起协调作用的节点,起协调作用的节点在对50个结果排序得到全部结果的前10个。 +To understand why ((("deep paging, problems with")))deep paging is problematic, let's imagine that we are +searching within a single index with five primary shards. When we request the +first page of results (results 1 to 10), each shard produces its own top 10 +results and returns them to the _coordinating node_, which then sorts all 50 +results in order to select the overall top 10. -现在想象我们请求第1000页--结果从10001到10010。所有都以相同的方式工作除了每个分片不得不产生前10010个结果以外。然后起协调作用的节点对全部50050个结果排序最后丢弃掉这些结果中的50040个结果。 +Now imagine that we ask for page 1,000--results 10,001 to 10,010. Everything +works in the same way except that each shard has to produce its top 10,010 +results. The coordinating node then sorts through all 50,050 results and +discards 50,040 of them! -看得出来,在分布式系统中,对结果排序的成本随分页的深度成指数上升。这就是为什么每次查询不要返回超过1000个结果的一个好理由。 +You can see that, in a distributed system, the cost of sorting results +grows exponentially the deeper we page. There is a good reason +that web search engines don't return more than 1,000 results for any query. **** -TIP: 在 <> 中我们解释了如何有效的获取大量的文档。 +TIP: In <> we explain how you _can_ retrieve large numbers of +documents efficiently. diff --git a/050_Search/20_Query_string.asciidoc b/050_Search/20_Query_string.asciidoc index 813de7a1d..f4340dab8 100644 --- a/050_Search/20_Query_string.asciidoc +++ b/050_Search/20_Query_string.asciidoc @@ -1,9 +1,14 @@ [[search-lite]] === Search _Lite_ -有两种搜索API的形式:一种精简查询-字符串版本在查询字符串中传递所有的参数,另一种功能全面的_request body_版本使用JSON格式并且使用一种名叫查询DSL的丰富搜索语言。 +There are two forms of the `search` API: a ``lite'' _query-string_ version +that expects all its((("searching", "query string searches")))((("query strings", "searching with"))) parameters to be passed in the query string, and the full +_request body_ version that expects a JSON request body and uses a +rich search language called the query DSL. -在命令行中查询-字符串搜索对运行特殊的查询是有益的。例如,查询在 `tweet` 类型中 `tweet` 字段包含 `elasticsearch` 单词的所有文档: +The query-string search is useful for running ad hoc queries from the +command line. For instance, this query finds all documents of type `tweet` that +contain the word `elasticsearch` in the `tweet` field: [source,js] -------------------------------------------------- @@ -11,11 +16,13 @@ GET /_all/tweet/_search?q=tweet:elasticsearch -------------------------------------------------- // SENSE: 050_Search/20_Query_string.json -下一个查询在 `name` 字段中包含 `john` 并且在 `tweet` 字段中包含 `mary` 的文档。实际的查询就是这样 +The next query looks for `john` in the `name` field and `mary` in the +`tweet` field. The actual query is just +name:john +tweet:mary -但是查询-字符串参数所需要的百分比编码让它比实际上的更含义模糊: +but the _percent encoding_ needed for query-string parameters makes it appear +more cryptic than it really is: [source,js] -------------------------------------------------- @@ -24,12 +31,15 @@ GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary // SENSE: 050_Search/20_Query_string.json -`+` 前缀表示必须与查询条件匹配。类似地, `-` 前缀表示一定不与查询条件匹配。没有 `+` 或者 `-` 的所有条件是可选的--匹配的越多,文档就越相关。 +The `+` prefix indicates conditions that _must_ be satisfied for our query to +match. Similarly a `-` prefix would indicate conditions that _must not_ +match. All conditions without a `+` or `-` are optional--the more that match, +the more relevant the document. [[all-field-intro]] ==== The _all Field -这个简单搜索返回包含 `mary` 的所有文档: +This simple search returns all documents that contain the word `mary`: [source,js] -------------------------------------------------- @@ -38,15 +48,19 @@ GET /_search?q=mary // SENSE: 050_Search/20_All_field.json -之前的例子中,我们在 `tweet` 和 `name` 字段中搜索内容。然而,这个查询的结果在三个地方提到了 `mary` : +In the previous examples, we searched for words in the `tweet` or +`name` fields. However, the results from this query mention `mary` in +three fields: * A user whose name is Mary * Six tweets by Mary * One tweet directed at @mary -Elasticsearch 是如何在三个不同的区域中查找到结果的呢? +How has Elasticsearch managed to find results in three different fields? -当你索引一个文档的时候,Elasticsearch 取出所有字段的值拼接成一个大的字符串,作为 `_all` 字段进行索引。例如,当我们索引这个文档时: +When you index a document, Elasticsearch takes the string values of all of +its fields and concatenates them into one big string, which it indexes as +the special `_all` field.((("_all field", sortas="all field"))) For example, when we index this document: [source,js] -------------------------------------------------- @@ -59,7 +73,7 @@ Elasticsearch 是如何在三个不同的区域中查找到结果的呢? -------------------------------------------------- -这就好似增加了一个名叫 `_all` 的额外字段: +it's as if we had added an extra field called `_all` with this value: [source,js] -------------------------------------------------- @@ -67,19 +81,24 @@ Elasticsearch 是如何在三个不同的区域中查找到结果的呢? -------------------------------------------------- -除非字段已经被指定,否则就使用 `_all` 字段进行搜索。 +The query-string search uses the `_all` field unless another +field name has been specified. -TIP: 在你刚开始使用 Elasticsearch 的时候, `_all` 字段是一个很实用的特征。之后,你会发现如果你在搜索的时候用指定字段来代替 `_all` 字段,对搜索出来的结果将有更好的控制。当 `_all` 字段对你不再有用的时候,你可以将它置为失效,向在 <> 中解释的。 +TIP: The `_all` field is a useful feature while you are getting started with +a new application. Later, you will find that you have more control over +your search results if you query specific fields instead of the `_all` +field. When the `_all` field is no longer useful to you, you can +disable it, as explained in <>. [[query-string-query]] [role="pagebreak-before"] -==== 更复杂的查询 +==== More Complicated Queries -下面对tweents的查询,使用以下的条件: +The next query searches for tweets, using the following criteria: -* `name` 字段中包含 `mary` 或者 `john` -* `date` 值大于 `2014-09-10` -* +_all_+ 字段包含 `aggregations` 或者 `geo` +* The `name` field contains `mary` or `john` +* The `date` is greater than `2014-09-10` +* The +_all+ field contains either of the words `aggregations` or `geo` [source,js] -------------------------------------------------- @@ -87,24 +106,39 @@ TIP: 在你刚开始使用 Elasticsearch 的时候, `_all` 字段是一个很 -------------------------------------------------- // SENSE: 050_Search/20_All_field.json -适当编码过的查询字符串看起来有点晦涩难读: +As a properly encoded query string, this looks like the slightly less +readable result: [source,js] -------------------------------------------------- ?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo) -------------------------------------------------- -从之前的例子中可以看出,这种简化的查询-字符串的效果是非常惊人的。在相关参考文档中做出了详细解释的查询语法,让我们可以简洁的表达很复杂的查询。这对于命令行随机查询和在开发阶段都是很好的。 +As you can see from the preceding examples, this _lite_ query-string search is +surprisingly powerful.((("query strings", "syntax, reference for"))) Its query syntax, which is explained in detail in the +{ref}/query-dsl-query-string-query.html#query-string-syntax[Query String Syntax] +reference docs, allows us to express quite complex queries succinctly. This +makes it great for throwaway queries from the command line or during +development. -然而,这种简洁的方式可能让排错变得模糊和困难。像 `-` , `:` , `/` 或者 `"` 不匹配这种易错的小语法问题将返回一个错误。 +However, you can also see that its terseness can make it cryptic and +difficult to debug. And it's fragile--a slight syntax error in the query +string, such as a misplaced `-`, `:`, `/`, or `"`, and it will return an error +instead of results. -字符串查询允许任何用户在索引的任意字段上运行既慢又重的查询,这些查询可能会暴露隐私信息或者将你的集群拖垮。 +Finally, the query-string search allows any user to run potentially slow, heavy +queries on any field in your index, possibly exposing private information or +even bringing your cluster to its knees! [TIP] ================================================== -因为这些原因,我们不推荐直接向用户暴露查询-字符串,除非这些用户对于集群和数据是可以被信任的。 - +For these reasons, we don't recommend exposing query-string searches directly to +your users, unless they are power users who can be trusted with your data and +with your cluster. ================================================== -相反,我们经常在产品中更多的使用功能全面的 _request body_ 查询API。然而,在我们达到那种程度之前,我们首先需要了解数据在 Elasticsearch 中是如何索引的。 +Instead, in production we usually rely on the full-featured _request body_ +search API, which does all of this, plus a lot more. Before we get there, +though, we first need to take a look at how our data is indexed in +Elasticsearch.