Commit Graph

  • 2e5ba75192 Optimize summary cache when it is effectively disabled Ivan Skytte Jørgensen 2016-03-22 12:23:10 +01:00
  • 88317cfab6 fix hashbang properly. Manual merge of 136d23816c68862ffcae14b84aec8fb39dc1e9a1 Brian Rasmusson 2016-03-22 11:29:40 +01:00
  • 833d99f805 oom shutdown fix Brian Rasmusson 2016-03-21 22:51:12 +01:00
  • 07200ca3dd hash bang fix. detect more corruption. don't dump titledb and spiderdb at same time, seems to reduce corruption in rdbmem. Manual merge of 61ef806dea75b630fb01cc0521f96df69bb96b46 and 136b8842dbdc844537bc5c59c99806fd65ecff27 Brian Rasmusson 2016-03-21 21:08:04 +01:00
  • e2e45dfbf6 Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2016-03-21 20:26:39 +01:00
  • 45e61b18af UDP Server now causes gb shutdown after 200 consecutive replies could not be sent due to Out of Memory Brian Rasmusson 2016-03-21 20:26:19 +01:00
  • c1cdd4a160 Use plain memcpy() in RdbList::addRecord() Ivan Skytte Jørgensen 2016-03-21 17:08:37 +01:00
  • ed565cc095 Add robots.txt precedence test with robots.txt in reverse order (disallow -> allow) Ai Lin Chia 2016-03-21 02:05:24 +01:00
  • 7b5a85e109 Add placeholder for more robots.txt tests Ai Lin Chia 2016-03-21 01:07:05 +01:00
  • 7caaa20f47 Add more unit test for robots.txt Ai Lin Chia 2016-03-21 00:24:21 +01:00
  • 70a020c4a8 Add more test to cover robots.txt precedence Ai Lin Chia 2016-03-20 20:34:39 +01:00
  • ead8322860 Add more unit test for robots.txt Ai Lin Chia 2016-03-20 18:04:48 +01:00
  • fa148a9c4f after dump completes scan tree to ensure all nodes reference secondary mem ptr so they don't get their data overwritten. Manual merge of 8bc653c31c29acc6c38d8691a4c3fba34e5f413b and 56bde4c3ef90d53cf565de05be115b88a540358f Brian Rasmusson 2016-03-19 23:35:38 +01:00
  • 5d3cb15bce few rearrangements for merge of commit 8bc653c31c29acc6c38d8691a4c3fba34e5f413b to work Brian Rasmusson 2016-03-19 23:18:53 +01:00
  • a009ee7d1f if running ./gb start and another gb is already bound on the port then quickly exit(0) and have the bash keep alive loop exit the loop based on that return value. we can't use ./cleanexit file because it doesn't get remove and will mess up the main process that is running. Manual merge of 0caf3458505fbe2eb860bb2de52d3088a91b71f0 Brian Rasmusson 2016-03-19 22:49:15 +01:00
  • c94216101e rename log files in the gb main.cpp code not in the bash loop. do not rename the log file if failed to start gb because socket was already bound. prevents us from double starts moving the log file, which is annoying. Manual merge of 36fdbf2f5aeb8e8c8ce0569d73899ff7f84cc0f8 Brian Rasmusson 2016-03-19 22:46:12 +01:00
  • 05b543d195 use ./cleanexit file to ensure gb doesn't restart after a graceful exit in the bash keep alive loop. Manual merge of a2e8a3a1fd03cce094497bf29d3b24b4d569403d Brian Rasmusson 2016-03-19 22:40:04 +01:00
  • 480ed2755e make sure firstKeyInQueue is set properly from current key, so reset list ptr before doing that in RdbDump.cpp. Manual merge of 7396e57660799e50daca8f5795f022398268cad2 Brian Rasmusson 2016-03-19 22:14:28 +01:00
  • 0885815b6a Fix high-frequency term accidental reduction change of endkey Ivan Skytte Jørgensen 2016-03-19 21:11:23 +01:00
  • 13283e977e Removed out-commented code from RdbList.h Ivan Skytte Jørgensen 2016-03-19 21:07:27 +01:00
  • b628f19f28 more constness in RdbList Ivan Skytte Jørgensen 2016-03-19 21:05:15 +01:00
  • 7e78b3fa19 Removed unused macros from RdbList.cpp Ivan Skytte Jørgensen 2016-03-19 20:44:07 +01:00
  • 70e04d436e constness in RdbList Ivan Skytte Jørgensen 2016-03-19 20:40:24 +01:00
  • a82080c3d2 Removed some unused methods from RdbList Ivan Skytte Jørgensen 2016-03-19 20:19:19 +01:00
  • 827c90cc0a Changed enum initialization from weird bit operation to plain number Ivan Skytte Jørgensen 2016-03-19 18:29:12 +01:00
  • ab72e994d4 Removed redundant decl of operator new/delete Ivan Skytte Jørgensen 2016-03-19 18:12:59 +01:00
  • 956762ab93 Removed redundant decl of g_numUrgentMerges Ivan Skytte Jørgensen 2016-03-19 18:03:49 +01:00
  • 162d6f0a32 Handle truncated utf8 characters correctly in link texts Ivan Skytte Jørgensen 2016-03-19 15:18:32 +01:00
  • bb02ad47e0 added error log before calling sendErrorReply to aid in debugging of out-of-mem problem on a signel host that caused all queries to fail. May be removed again later Brian Rasmusson 2016-03-18 23:44:28 +01:00
  • b88ed660c0 Moved local key comparison functions/macros from RdbList.h to RdbList.cpp Ivan Skytte Jørgensen 2016-03-18 22:24:59 +01:00
  • fcb3a5ff3f Make docIdSplits range configurable, defaulting to [5..15] Ivan Skytte Jørgensen 2016-03-18 22:15:11 +01:00
  • 6108f06e65 dump usetimeaxis when dumping titledb. forgotten commit from manual merges Brian Rasmusson 2016-03-18 21:19:58 +01:00
  • d76518d556 Add unit test for robots.txt. Disable failing unit test for now Ai Lin Chia 2016-03-18 17:46:01 +01:00
  • 8956ca00e5 Use plain memcpy() instead of gbmemcpy() in areas where overlaps are impossible Ivan Skytte Jørgensen 2016-03-18 16:00:29 +01:00
  • 5484267ede Log various summary-cache/loop/chunking/highfreq messages with query: prefix Ivan Skytte Jørgensen 2016-03-18 15:19:48 +01:00
  • 021f0288ae Add constness to Mime & Robots Ai Lin Chia 2016-03-18 10:43:56 +01:00
  • 02647dc862 Remove unused cacheStart & cacheLen from Robots::isAllowed Ai Lin Chia 2016-03-18 10:29:46 +01:00
  • ed840968ec Code style changes Ai Lin Chia 2016-03-18 10:17:35 +01:00
  • 10bf87c2ed Add some todo comments. Remove commented out code Ai Lin Chia 2016-03-18 10:04:43 +01:00
  • bca9d17fef Move ScalingFunctions unit test to test/unit Ai Lin Chia 2016-03-17 17:35:08 +01:00
  • 36439d2a41 Move BitOperations unit test to test/unit Ai Lin Chia 2016-03-17 17:23:41 +01:00
  • f453057c05 Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2016-03-17 22:42:19 +01:00
  • 103e76e246 if old title rec was corrupted we would get a random docid when re-spidering the url causing some chaos. now things should return to normal and we should overwrite the corrupted titlerec on the next spidering. also, no longer do robots.txt titlerec lookups. silly. Manual merge of 0b5f41734934c84eec89fb4e35691d20be7a8f78 Brian Rasmusson 2016-03-17 22:15:32 +01:00
  • 4af9d2a0f4 Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2016-03-17 21:54:47 +01:00
  • 67a63880ee do not allow crawlbot seeds to be deduped out. Manual merge of 58993dbbf99a27bcd9d4b5c27fda08355f261284 Brian Rasmusson 2016-03-17 21:36:58 +01:00
  • 7328852305 fix the source of lots of corruption in spiderdb and titledb. rdbmem.cpp was storing in secondary mem which got reset when dump completed. also do not add keys that are in collnum and key range of list currently being dumped, return ETRYAGAIN. added verify writes parm. clean out tree of titledb and spiderdb corruption on startup. Manual merge of 8a65d213715de57e3a4bf8d344b932f7b02acb09 Brian Rasmusson 2016-03-17 21:35:10 +01:00
  • b448bb6885 fix to allow us to gather ip-only url outlinks again. Manual merge of 0dbc304bbf33715b2aefd26700911a1d0201c507 Brian Rasmusson 2016-03-17 20:52:01 +01:00
  • 9fe07d5e2d print HTTP status on info page. Partial merge of 2c167aada7ec539f26f0c2aec66068199d5ca1a9 Brian Rasmusson 2016-03-17 20:50:22 +01:00
  • 13a9c75155 fix another core caused by deleted coll. manual merge of d6fe684b996a65203c0049a06f2994470286b96b Brian Rasmusson 2016-03-17 20:36:51 +01:00
  • 7f7aa4771b ignore meta redirect tags in html comment tags. Manual merge of e75d80abbe9a3e45a5c55cdecc7e9e3517c59c1d Brian Rasmusson 2016-03-17 20:30:30 +01:00
  • 0912f3242b fix neverending crawl rounds by only trying each url once per round. updated url filters. Manual merge of 412b04bbd4fb40cbe37cb613e5d28f495b711cc5 Brian Rasmusson 2016-03-17 20:05:39 +01:00
  • 15d143764d Removed redundant declarations of handleRequest25() Ivan Skytte Jørgensen 2016-03-17 19:56:27 +01:00
  • 6cdc4ab284 try to fix a couple more core dumps. Manual merge of da9949f462515ece4827c531003afdc645393567 Brian Rasmusson 2016-03-17 17:23:39 +01:00
  • 0087b3154c fix core from a federated query and null msg20. Manual merge of c7696a69eb0c2da031b63697b8736fee45ad1001 Brian Rasmusson 2016-03-17 17:15:56 +01:00
  • 7d5e515203 if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal. Manual merge of f649944573dc0df9008fd23a657f235d7518aab9 Brian Rasmusson 2016-03-17 17:13:27 +01:00
  • 309aae0aa1 fix core dump from deleting an active/dumping collection. Manual merge of f11595efc3236a8c2f523bcb8014208f400b39bd Brian Rasmusson 2016-03-17 17:10:08 +01:00
  • cab0c721ff fix core in posdbtable from docid of 0. no idea why docid was 0, but why core. manual merge of e68406f073d57774f5a462a58ecd526790553a1b Brian Rasmusson 2016-03-17 16:55:06 +01:00
  • 774431992f Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2016-03-17 16:51:42 +01:00
  • a5217ae456 fix gap.com redirects that require us setting multiple cookies in the spider request + let's generalize it. if a redirect sets cookies then follow it through, don't stop in the middle because we think it is 'simplified'. Manual merge of d7a6a0a1fff182f91a06acea49ceae9a16fccc55 and e376b978147171ea6ef2929e086b56083c8475f0 Brian Rasmusson 2016-03-17 16:51:27 +01:00
  • 8deb36fd82 constness for findChar/findCharReverse/findQuoteChar/findEqualChar Ivan Skytte Jørgensen 2016-03-17 16:35:05 +01:00
  • 64d5d07e27 Reduce use of default argument values in functions that have overloads Ivan Skytte Jørgensen 2016-03-17 16:17:48 +01:00
  • adae3cbd24 Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2016-03-17 15:49:28 +01:00
  • 64408b275a bring back max doc len parms. index gbssIsContentTruncated field. fix 30-day wait for >= 3 errors. Manual (partial) merge of 1d2dfe14563faf4e03b1ef25e848d8237776f0eb Brian Rasmusson 2016-03-17 15:49:17 +01:00
  • 21e42779f2 Use gcc __builtin_... functions for low-level bit operations Ivan Skytte Jørgensen 2016-03-17 15:07:25 +01:00
  • 81b95f58ac Make unittest work again Ai Lin Chia 2016-03-17 14:22:27 +01:00
  • b1eb2b52c9 Move isAllowed2 from XmlDoc into Robots.cpp Ai Lin Chia 2016-03-17 13:07:52 +01:00
  • d7c3a90544 fix core from treating corrupted titlerecs as non-existent. Manual merge of cdb8a5f86a90848f1608ad652645250536d3f73c Brian Rasmusson 2016-03-17 14:18:11 +01:00
  • ce051e112b watch out for negative datasize spider requests in doledb when calling xmldoc::set4 so we don't core any more. Manual merge of c1a72213d75428d9ff401626621e07a9ab727c44 Brian Rasmusson 2016-03-17 14:15:24 +01:00
  • 5a5063a284 Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2016-03-17 14:11:50 +01:00
  • 7b1b760c28 Manual merge of fixes for handling of corrupted titledb recs. d183525db6bba931c51eb0096fe18c02e3e44be1 Brian Rasmusson 2016-03-17 14:11:43 +01:00
  • 19553087a1 Improve Query:.dumpToLog() Ivan Skytte Jørgensen 2016-03-17 13:58:29 +01:00
  • 3ac81f3f55 We should not ignore robots.txt from archive.org Ai Lin Chia 2016-03-17 11:18:56 +01:00
  • 1f8836f296 fix invalid <base href=/> tag. Manual merge of 4eec8eb5b7f7c94d2ef6bd3c1b28c7172d1221cc Brian Rasmusson 2016-03-17 10:11:58 +01:00
  • 093fe55448 Fix various always-true/false conditions flagged by Flexelint Ivan Skytte Jørgensen 2016-03-17 01:15:57 +01:00
  • ef75882508 Rewrite unusual use of boolean values Ivan Skytte Jørgensen 2016-03-17 01:05:18 +01:00
  • decb18d094 Parenthesize expression-like macros Ivan Skytte Jørgensen 2016-03-17 00:55:14 +01:00
  • 96a1071478 Commented out unused local static s_quickTables Ivan Skytte Jørgensen 2016-03-17 00:48:30 +01:00
  • b2744f862e Remvoed unused local function emailSleepWrapper() Ivan Skytte Jørgensen 2016-03-17 00:47:35 +01:00
  • e925f359e1 Do SafeBuf::setBuf() correctly using sizeof() Ivan Skytte Jørgensen 2016-03-17 00:41:25 +01:00
  • d0422dffdd Use sizeof(File) instead of (almost-arbitrary) constant 210 for quick-buffer Ivan Skytte Jørgensen 2016-03-17 00:40:19 +01:00
  • ac237739fb Simplify hack-of-confusion loop in Posdb.cpp Ivan Skytte Jørgensen 2016-03-15 16:17:03 +01:00
  • efbb3ae969 More simple constness in Posdb.* Ivan Skytte Jørgensen 2016-03-15 14:35:51 +01:00
  • ca7482f6b3 Rdb.* constness (just a bit) Ivan Skytte Jørgensen 2016-03-15 14:04:30 +01:00
  • c961aa0920 Make limiting (min/max) more readable by using macros Ivan Skytte Jørgensen 2016-03-15 13:51:31 +01:00
  • 47260728d9 PosdbTable::m_msg2 is always non-NULL Ivan Skytte Jørgensen 2016-03-15 12:48:17 +01:00
  • 1f793773ea Various cleanup in Posdb.* Ivan Skytte Jørgensen 2016-03-15 12:43:51 +01:00
  • 60085859e1 Refactor whitelist preperation out into prepareWhiteListTable() Ivan Skytte Jørgensen 2016-03-15 12:18:06 +01:00
  • 98e7447ce9 Various cleanup in HashTableX,cpp Ivan Skytte Jørgensen 2016-03-14 23:22:13 +01:00
  • bb3aee3e94 constness in HashTable/HashTableX Ivan Skytte Jørgensen 2016-03-14 23:10:06 +01:00
  • 9ab1bbb21a const reference in HashTableT<>::getOccupiedSlotNum() Ivan Skytte Jørgensen 2016-03-14 22:36:33 +01:00
  • cea0793984 constness of printTermList() and getHashGroupString() Ivan Skytte Jørgensen 2016-03-14 22:30:21 +01:00
  • a10cebd564 Removed unused parameter 'collnum' to PosdbTable::init() Ivan Skytte Jørgensen 2016-03-14 22:17:46 +01:00
  • 9819f64b4b cleanup: unused variable assignments Ivan Skytte Jørgensen 2016-03-14 22:06:14 +01:00
  • 670eea4a51 constness for PosdbTable::getTermPairScoreForWindow() and PosdbTable::getTermPairScoreForAny() Ivan Skytte Jørgensen 2016-03-14 17:52:08 +01:00
  • aac566ac33 Fix indentation error Ivan Skytte Jørgensen 2016-03-14 17:42:30 +01:00
  • 7968720c74 Fix indentation error Ivan Skytte Jørgensen 2016-03-14 17:23:21 +01:00
  • f0721c6acd Added missing content types to avoid coredumps when logging e.g. gz URLs in XmlDoc Brian Rasmusson 2016-03-14 14:32:03 +01:00
  • 5afb0a085e Cleanup in Synonym.h (unused members, constness, etc) Ivan Skytte Jørgensen 2016-03-14 13:54:07 +01:00
  • c7ec4c87b1 Added Query::dumpToLog() method Ivan Skytte Jørgensen 2016-03-14 12:18:08 +01:00
  • 92a865212d Removed un-accessed static variables from main.cpp Ivan Skytte Jørgensen 2016-03-14 00:50:37 +01:00