Commit Graph

  • b33a60ca5e remove log spam. Zak Betz 2015-10-07 10:05:06 -06:00
  • 0f453d5cdf Merge branch 'ia-zak' into testing Matt 2015-10-07 10:02:38 -06:00
  • dec683387a Merge branch 'ia-zak' into testing Matt 2015-10-07 10:02:13 -06:00
  • 16db36252c Merge branch 'diffbot-testing' into testing Matt 2015-10-07 10:02:06 -06:00
  • 46315ac7a3 fix multiple title rec gbcapture date bug Matt 2015-10-07 09:30:38 -06:00
  • a3de262ebe Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-10-07 08:56:08 -06:00
  • 45744d74f3 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream Zak Betz 2015-10-07 08:46:07 -06:00
  • a9410738ae fix permissions bug when creating directories, need to put in user/group execute bit. Matt 2015-10-07 08:26:27 -06:00
  • 4600ce0816 fix threads from freezing up just because pthread_create() had an error. need to return the thread stack. Matt 2015-10-07 07:50:44 -06:00
  • 836aa1756d fix threads from freezing up just because pthread_create_thread() had errors. need to return the thread stack to the linked list of thread stacks. Matt 2015-10-07 07:49:34 -06:00
  • 6d90ea2e5f try to launch threads even if none need cleanup. hopefully fixes thread freeze. Matt 2015-10-07 07:29:37 -06:00
  • cee5d8922a Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-10-05 18:48:06 -06:00
  • 5b605624ca fix core dump average": 3.677707812426755e+26, Matt 2015-10-05 18:47:11 -06:00
  • a77c9be5b8 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-10-05 17:32:21 -06:00
  • df1c7f6e0f update qa.cpp syntax test to do &n=100 for gbssStatusCode:0 query Matt 2015-10-05 17:31:35 -06:00
  • 97b9c99bec correcting facet min=0 sam 2015-10-05 16:09:52 -07:00
  • 49ec9c99fd Don't restart all items when forcing a list of items into injector. Zak Betz 2015-10-05 15:33:09 -06:00
  • 9b785a1522 allow more than 2gb of mem to be allocated to hold resulting docids. Matt 2015-10-05 09:35:44 -07:00
  • 757a44b149 fix facets when doing > 1 split and first split termlist is empty. Matt 2015-10-05 10:05:08 -06:00
  • 21b71226a6 remove bad fix Matt 2015-10-05 09:33:02 -06:00
  • c947252fee Add gbcapturedate to individual doc's metadata when injecting warcs. Zak Betz 2015-10-04 01:53:54 -06:00
  • 39214a9dc6 Merge branch 'diffbot-testing' into testing Matt 2015-10-02 19:26:15 -06:00
  • 42cdd5b382 fix msg20 getsummary core Matt Wells 2015-10-02 12:34:08 -07:00
  • e4adc99c0c fix empty winner tree bug. try to improve rdbcache promotion logic for all caches. -O2 on spider.cpp. Matt Wells 2015-10-02 12:16:48 -07:00
  • 9daaa4d5af Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-10-01 19:37:12 -07:00
  • 9178d67b2f fix churn bug in winnerlistcache in spider.cpp so do not add the dolebuf list of spiderrequests back into the cache, but just modify the "jump" in the first 4 bytes of the cached record. because when we re-added it back to the cache it created too much churn and we'd lose cached records unnecessarily. Matt Wells 2015-10-01 19:35:34 -07:00
  • 6becb55a2b Stream warcs instead of downloading them and unzipping them on disk. Zak Betz 2015-09-30 22:25:59 -06:00
  • a8e3e4b269 if metadata is already in the old xmldoc::ptr_metadata then do not re-add it. Matt 2015-09-30 21:46:07 -06:00
  • e06ae06c23 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-30 15:27:07 -06:00
  • a31c7f5fc8 exiting msg Matt 2015-09-30 15:26:58 -06:00
  • 06aea41611 show spiderdb scan progress in spider queue for the collection Matt Wells 2015-09-30 13:38:22 -07:00
  • b97546f98c do not expand about:blank iframes. Matt Wells 2015-09-30 09:36:04 -07:00
  • cb4bbe8892 Merge branch 'ia' into ia-zak Matt 2015-09-30 07:58:31 -06:00
  • d4c677170f index metadata on EDOCUNCHANGED errors, and append new meta data to XmlDoc::ptr_metadata. Matt 2015-09-30 07:57:40 -06:00
  • 67fc339953 prevent out of mem core. actually trying to alloc more than 2GB for search result stuff. Matt 2015-09-26 21:34:07 -07:00
  • f0a2f86200 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-25 08:09:25 -07:00
  • 55993e58d5 fix cores on gi #0 Matt 2015-09-25 08:09:05 -07:00
  • 2721256c0d show ip port of bad host Matt Wells 2015-09-25 07:47:21 -07:00
  • 68a3679fae exit if can not load auth/internetarchive.yml file Matt 2015-09-25 08:33:25 -06:00
  • 83ac18fff4 Merge branch 'master' into testing Matt 2015-09-25 08:25:19 -06:00
  • e2fad81227 Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing Matt 2015-09-25 08:24:54 -06:00
  • 3ce6c7d941 Merge branch 'ia-zak' into testing Matt 2015-09-25 08:24:12 -06:00
  • 1e8f656d30 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-25 08:23:42 -06:00
  • 93943a0cab some pages legitamately have no outlinks, no need to think they were banned. Matt Wells 2015-09-24 14:01:23 -07:00
  • 6454dad6bf Revert "ignore real root, just use seeds, to detect if banned." Matt Wells 2015-09-24 13:29:20 -07:00
  • cb60f68e72 ignore real root, just use seeds, to detect if banned. Matt 2015-09-24 14:15:29 -06:00
  • 260864b364 urgent fix for core dumps for some queries that have long termlists. Matt 2015-09-24 11:49:50 -06:00
  • 268b21d552 reduce log spam Matt Wells 2015-09-24 11:37:16 -06:00
  • 9be3f9310e fix annoying core dump for some queries in Posdb.cpp Matt 2015-09-24 11:34:02 -06:00
  • 8a0461b82f Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-24 09:10:37 -06:00
  • d92b153090 added 'verify writes' switch to track down data corruption Matt 2015-09-24 09:10:20 -06:00
  • faedea4a9f Fix repeating label. Zak Betz 2015-09-24 01:33:51 -06:00
  • ce29993951 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-24 01:08:56 -06:00
  • ae44f295e7 Fix some bad html in statsdb graph. Zak Betz 2015-09-24 01:08:40 -06:00
  • 025d7e6e30 fix statsdb plot graph breach Matt 2015-09-23 19:53:29 -06:00
  • 3dcaf414db report bytes saved to disk. if thread crashes try to dump core. Matt Wells 2015-09-23 15:40:30 -07:00
  • 98744889e2 do not core if no collrec for msg20 summary request Matt Wells 2015-09-23 14:39:13 -07:00
  • ba8ebc7794 Revert "data corruption fixes" Matt Wells 2015-09-23 14:38:17 -07:00
  • 27172945c7 data corruption fixes Matt Wells 2015-09-23 14:34:52 -07:00
  • 2fde3ac5bc call umask() to fix gb process umask so files created are group writable Matt 2015-09-22 12:23:33 -06:00
  • f9442ac5cd Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-09-21 16:46:36 -06:00
  • 16b6e44bd1 Show utf8 url in page results. Zak Betz 2015-09-21 16:44:40 -06:00
  • 100888d691 fix file/dir creation permissions bugs Matt 2015-09-21 12:44:41 -06:00
  • 74cde33a3a just use the user's umask val for all file/dir creation Matt 2015-09-21 11:33:38 -06:00
  • ce7b06fc4d all files made are now group writable. if you don't like that then you can make a special group and set the directory just group writable for that group using chmod g+s <dir>. Matt 2015-09-21 11:19:34 -06:00
  • eefbe95ce9 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-21 10:13:29 -06:00
  • 786ba76d10 Merge branch 'ia-zak' into testing Matt 2015-09-21 10:12:58 -06:00
  • 5aaa08d81a Merge branch 'ia-zak' of github.com:gigablast/open-source-search-engine into ia-zak Matt 2015-09-21 10:12:38 -06:00
  • 55169be6fc Warc injector update. Zak Betz 2015-09-21 09:31:59 -06:00
  • 83190e3bbc Make punycoded urls printable. Zak Betz 2015-09-21 09:17:40 -06:00
  • 5635695666 oom prevention Matt 2015-09-20 21:42:46 -06:00
  • d6d5d10a15 prevent core from bad root title rec Matt Wells 2015-09-20 08:26:00 -07:00
  • 69a3cb0999 fix corrupt tag with corrupt root title buf from coring Matt Wells 2015-09-17 21:33:58 -07:00
  • 13e0ba7bff fix bug of having a meta redirect tag in <script> tags. we have to use Xml class to make sure it is a legit refresh tag. Matt 2015-09-16 11:03:38 -06:00
  • 58e9f56015 never let any diffbot error prevent us from retrying a url in subsequent crawl rounds. Matt 2015-09-16 10:00:11 -06:00
  • bcdecc63c6 expose "urlip" injection parm to provide ip of url being injected to save gigablast from an ip lookup if you want. Matt 2015-09-16 09:43:15 -06:00
  • d11761cfd9 update graph key Matt 2015-09-15 15:48:40 -06:00
  • b2f7e72d8a fix core from showing graph to two users Matt 2015-09-15 15:46:38 -06:00
  • 32d7f5cb97 better warc injection load balancing Matt 2015-09-15 15:04:26 -06:00
  • aada7d5e51 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-14 22:14:01 -06:00
  • 28ed7f66af Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-14 19:16:20 -06:00
  • f9c4f8fc9a test with 2 first Matt 2015-09-14 19:16:07 -06:00
  • fcd4fe3ff3 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-14 19:04:15 -06:00
  • 16d90e60d1 all up to 10 filter threads now Matt 2015-09-14 19:03:52 -06:00
  • 5caa219c71 Reduce false positives by not counting \0 as a non-ascii char in the url. Zak Betz 2015-09-14 12:24:50 -06:00
  • 5d724cdcc3 Check for spaces before non-ascii chars to reduce false positives. Also print the position of non-ascii char to aid debugging. We still need to handle utf8 chars in path. Zak Betz 2015-09-14 11:11:56 -06:00
  • 519b2c4f42 Fix repeating xn--xn-- when there are spaces in the domain. Make gb unittest take a name of the unit test to run. Zak Betz 2015-09-14 10:24:22 -06:00
  • 519017828c Enable punycode domains for testing. We still need to display them as utf8 on the front end. Zak Betz 2015-09-14 09:32:25 -06:00
  • 78125c809b Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-09-14 00:53:40 -06:00
  • 68a0d08820 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-14 00:35:22 -06:00
  • 5622ca47ee Work on non-ascii domain names. It works on correct inputs, but will crash on some non correct inputs, so it is forced to be disabled. Zak Betz 2015-09-14 00:34:44 -06:00
  • b4bac67fdf Merge branch 'diffbot-testing' into testing Matt 2015-09-13 19:40:57 -06:00
  • 710661a0f3 Merge branch 'ia-zak' into ia Matt 2015-09-13 19:40:21 -06:00
  • ffa465b942 Merge branch 'ia' of github.com:gigablast/open-source-search-engine into ia Matt 2015-09-13 19:40:17 -06:00
  • f1db3aca94 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-13 19:39:50 -06:00
  • b8e4046d61 remove unnecessary line Matt Wells 2015-09-13 17:54:38 -07:00
  • 65613feb4c fix bug of not using part files when generating map Matt 2015-09-13 17:52:40 -07:00
  • cb6ca24c26 Allow nospider and noquery on the same host. Fix punycoding of non-ascii domains. Zak Betz 2015-09-13 17:15:31 -06:00
  • 3444c67851 exit faster. Matt 2015-09-13 14:28:35 -07:00
  • 6370054bd5 fix problem of adding too many collections and not wrapping the collnum_t id Matt Wells 2015-09-13 14:21:52 -07:00