Lately we ran into Solr stability issues in production losing the nodes out of the 3 nodes cluster.
We have a Solr 4.6 Zookeeper 3.5.8 running on Linux Centos machines. There are about 15 sitecore cores/collections configured on Solr instances.
3 nodes cluster setup
The code base that generates this query has been amended, it should now request only 1 row. Additionally it will call more selectively to decue the number of calls. Fixe will be distributed in next code delivery.
The default behaviour of Sitecore is to get all the fields of the document. In the sample above we do this type of call with rows=10, so it not expensive. The sorting is necessary as this query feeds the KAS articles listing, still is cheaper to sort in Solr than sor in memory in Sitecore. Alternatively a new core could be setup with fewer fields.
This is managed by Sitecore internally. We have added configuration fixes as currently the logs reveal that all the 5 servers are committing to sitecore_analytics_index. After the fix only CM server will doing this.
This happens in 2 occasions: one automatically only when a full rebuild is executed or by Sitecore scheduled agent which currently it is triggered on sitecore_master_index
sitecore_analytics_index Indexes all the visitors interactions, and it is not populated by Sitecore content tree data. We have added configuration fixes as currently the logs reveal that all the 5 servers are committing to sitecore_analytics_index. After the fix only CM server will doing this. It looks like a lot of empty were produced in the index due to 4x deletions at the time
We have a Solr 4.6 Zookeeper 3.5.8 running on Linux Centos machines. There are about 15 sitecore cores/collections configured on Solr instances.
16GB of ram - JVM set at 4GB
Java 1.7
We noticed that sitecore_analytics_index
core has a lot of records which is not a big deal for Solr. However, we are
using the standard out-of-the-box configuration. Is there anything we need to
config for Sitecore cores to perform better? What are the best practices for
Sitecore with Solr (if any)?
Also, Sitecore is running very
heavy queries (not sure which core is doing this) so it is eating a lot of JVM
resources. This causes our solr instance to go down.
Some errors we are experiencing:
- · ERROR SolrDispatchFilter null:ClientAbortException: java.net.SocketException: Broken pipe
- · SolrCore [sitecore_marketingdefinitions_web] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
- · ERROR java.lang.OutOfMemoryError: Requested array size exceeds VM limit
- · ERROR SolrCmdDistributor org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=2,? try again later.
Solutions applied:
I Increased the memory footprint on all the 3 Linux servers to 32 GB RAM and 16 GB JVM
1.
Login on the Linux server as ‘root’
2.
sudo service tomcat stop
3.
sudo kill -9 pid (tomcat) – only if tomcat did
not stop gracefully
4.
Navigate to /opt/tomcat/bin
5.
Create a new setenv.sh file
6.
Edit the file and add
export JAVA_OPTS="$JAVA_OPTS\
-Xms12288m\
-Xmx12288m\
-XX:+HeapDumpOnOutOfMemoryError\
-XX:HeapDumpPath=/var/log/\
-XX:MaxPermSize=1024m\
-XX:MaxNewSize=2048m\
-XX:NewSize=2048m"
7.
Save the file
8.
sudo service tomcat start
9.
Navigate to solr1.environment.pmi.org:8080/solr
10.
Ensure the Physical Memory and JVM-Memory has
changed.
The Solr was restored and was stable for a few days but the errors started creeping up and we saw the same behavior.
We captured the catalina logs from tomcat/logs folder and did indepth analysis.
We found that all the 5 Sitecore web instances were configured to logs the analytics index Therefore Solr was getting bombarded with the exceptions and the sitecore_analytics_index was growing enormously. (1GB every week ).
Some of the most common
issues we notice in production logs:
1. Why
is sitecore retrieving MAX rows available? This is very costly for solr.
454893853 [http-bio-8080-exec-4226]
INFO org.apache.solr.core.SolrCore – [sitecore_web_index]
webapp=/solr path=/select params={q=(slug_s:(\/certifications\/types\/pmp)+AND+_latestversion:(True))&fq=_indexname:(sitecore_web_index)&version=2.2&rows=2147483647}
hits=0 status=0 QTime=0
454877084 [http-bio-8080-exec-4197]
INFO org.apache.solr.core.SolrCore – [sitecore_web_index]
webapp=/solr path=/select
params={q=(slug_s:[*+TO+*]+AND+_group:(25e875fea5fa4ff3821a1413d4ee3bc1))&fq=_indexname:(sitecore_web_index)&version=2.2&rows=2147483647}
hits=2 status=0 QTime=7
454877096 [http-bio-8080-exec-4225]
INFO org.apache.solr.core.SolrCore – [sitecore_web_index]
webapp=/solr path=/select
params={q=(slug_s:[*+TO+*]+AND+_group:(25e875fea5fa4ff3821a1413d4ee3bc1))&fq=_indexname:(sitecore_web_index)&version=2.2&rows=2147483647}
hits=2 status=0 QTime=0
The code base that generates this query has been amended, it should now request only 1 row. Additionally it will call more selectively to decue the number of calls. Fixe will be distributed in next code delivery.
2. Why
is sitecore retrieving all fields and also doing sorting on date fields. This
is very costly for solr. (Eg: SELECT *.* )
454877030 [http-bio-8080-exec-4220]
INFO org.apache.solr.core.SolrCore – [sitecore_web_index]
webapp=/solr path=/select params={facet=true&sort=publicationdate_tdt+desc&fl=*,score&start=0&q=*:*&f.contenttype_facet_s.facet.mincount=1&facet.field=contentsources_facet_sm&facet.field=topicsfacet_sm&facet.field=contenttype_facet_s&fq=((((((((_templates:(51fe426158da421da104f3cbc23e328d)+AND+-_templates:(216357ebc69e46ceb912707c2dba28a1))+AND+_path:(110d559fdea542ea9c1c8a5df7e70ef9))+AND+-excludefromsearch_b:(True))+AND+_latestversion:(True))+AND+_language:(en))+AND+publicationdate_year_tl:[-2147483648+TO+2012])+AND+publicationdate_year_tl:[2005+TO+2147483647])+AND+((topicsfacet_sm:("Portfolio+Management")+AND+topicsfacet_sm:("Scope+Management"))+AND+topicsfacet_sm:("Program+Management")))&fq=_indexname:(sitecore_web_index)&version=2.2&f.topicsfacet_sm.facet.mincount=1&f.contentsources_facet_sm.facet.mincount=1&rows=10}
hits=2 status=0 QTime=5
The default behaviour of Sitecore is to get all the fields of the document. In the sample above we do this type of call with rows=10, so it not expensive. The sorting is necessary as this query feeds the KAS articles listing, still is cheaper to sort in Solr than sor in memory in Sitecore. Alternatively a new core could be setup with fewer fields.
3. Why
is sitecore doing hard commits directly on the index. Solr does not recommend
because there could be open searches that are not available.
54956494 [http-bio-8080-exec-4178]
INFO org.apache.solr.update.processor.LogUpdateProcessor –
[sitecore_analytics_index] webapp=/solr path=/update params={waitSearcher=true&commit=true&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2&softCommit=false}
{} 0 13566
454956495 [http-bio-8080-exec-4178]
ERROR org.apache.solr.core.SolrCore –
org.apache.solr.common.SolrException: Error opening new searcher. exceeded
limit of maxWarmingSearchers=2, try again later.
This is managed by Sitecore internally. We have added configuration fixes as currently the logs reveal that all the 5 servers are committing to sitecore_analytics_index. After the fix only CM server will doing this.
4. Solr
does not recommend to optimize the index instantly. Especially for cores like
sitecore_analytics_index.
455450829 [http-bio-8080-exec-4167]
INFO org.apache.solr.update.processor.LogUpdateProcessor –
[sitecore_analytics_index] webapp=/solr path=/update params={optimize=true&waitSearcher=true&maxSegments=1&wt=javabin&expungeDeletes=false&commit_end_point=true&version=2}
{optimize=} 0 155892
This happens in 2 occasions: one automatically only when a full rebuild is executed or by Sitecore scheduled agent which currently it is triggered on sitecore_master_index
sitecore_analytics_index Indexes all the visitors interactions, and it is not populated by Sitecore content tree data. We have added configuration fixes as currently the logs reveal that all the 5 servers are committing to sitecore_analytics_index. After the fix only CM server will doing this. It looks like a lot of empty were produced in the index due to 4x deletions at the time