Commit Graph

  • 50d2cf9bc1 Removed obsolete private libiconv (Closes: #167) Dmitry Smirnov 2021-05-05 10:49:15 +10:00
  • c124eda914 cleanup: remove local zlib. All distros provide zlib1g-dev. Dmitry Smirnov 2021-05-05 10:26:56 +10:00
  • 3a55b74050 cleanup: removed useless local binaries (libgcc.a libc.a) Dmitry Smirnov 2021-05-05 10:24:45 +10:00
  • e18d2396a6 Removed private OpenSSL [hygiene,FTBFS]. All distros provide OpenSSL. Dmitry Smirnov 2021-05-05 10:20:11 +10:00
  • 7a2bca9649 Compile with "-std=c++98" to fix FTBFS (Closes: #164) Dmitry Smirnov 2021-05-05 10:07:55 +10:00
  • 9146f05574
    Merge pull request #157 from shijuraj/master Gigablast 2020-05-04 08:55:25 -06:00
  • c1e5f9fd7f
    Merge pull request #1 from gigablast/master Shijuraj J 2019-09-21 08:24:58 +05:30
  • 4a943f1c79 Merge pull request #136 from vonbetz/master Gigablast 2017-06-02 11:32:56 -06:00
  • f10fdada73 Fix infinite loop on malformed proxy. Zak Betz 2017-06-02 11:28:58 -06:00
  • f9d607665b forgot comma. diffbot-testing Matt Wells 2017-02-24 10:01:26 -08:00
  • 6eb802054b fix some corruption in spider data after deleting a collection. Matt Wells 2017-02-24 09:59:28 -08:00
  • 88120414b0 import more stuck job fixes. Matt Wells 2016-11-16 10:59:55 -08:00
  • f1b6f73719 empty &seeds= fix to not reset seeds Matt Wells 2016-11-16 10:34:25 -08:00
  • 53ee1039b8 import final hung crawl fix into old gb. xor the firstip into the doledb key this time. seems to avoid all collisions now so we don't overwrite nodes in the doledb tree. Matt Wells 2016-11-07 09:11:27 -08:00
  • 3d248732d0 fix to shut up app checker. Matt 2016-11-04 17:28:26 -06:00
  • c0b2cdb60a hide the verify disk writes parm, seems to be causing cores when activated. and shouldn't really need to be used. is for debugging disk issues. Matt 2016-11-04 17:09:15 -06:00
  • 4b889e0ddd quick fix Matt Wells 2016-11-01 11:39:15 -07:00
  • 93d5752ab7 import fix for jobs hanging from pro. Matt Wells 2016-11-01 11:18:40 -07:00
  • 1542f7c57f do not save doledb on exit to prevent corruption being propagated and in case we change default spider priorities in the url filters code which could cause hung jobs. Matt Wells 2016-10-24 14:23:13 -07:00
  • d5bc775ce5 fix errCount bug in url filters from errCount wrapping over to negative numbers. Matt 2016-09-20 17:02:51 -06:00
  • f23daf3e5e more fixes for 'zeroing out' code. Matt Wells 2016-09-14 10:32:01 -07:00
  • 93f878fbca wow this thing is really being persnickety Matt Wells 2016-09-14 10:18:48 -07:00
  • 5d54fde09c more zerout fixes Matt Wells 2016-09-14 10:10:03 -07:00
  • 16ec2a6963 fix stack related bug. Matt Wells 2016-09-14 10:00:18 -07:00
  • 4ac26f9c9b rotate log file at 1GB. Matt Wells 2016-09-14 09:37:23 -07:00
  • 8a5891f2ec fix memory leak in Parms.cpp Matt 2016-09-12 13:41:07 -06:00
  • e9f0d067be zero out docs that we do not process to save disk space. record content hash in 'zeroed out' content so we preserve it for deduping. Matt Wells 2016-08-30 11:06:59 -07:00
  • 616bfeea86 show corrupt collection numbers for spiderdb corrupted recs. Matt Wells 2016-08-15 12:55:22 -07:00
  • 7e77b900a6 fix a core from corrupted spider request in doledb in rdb::reclaimMemFromDeletedTreeNodes() Matt Wells 2016-06-21 11:54:09 -07:00
  • b03571c1a1 fix infinite loop from corrupt spider request Matt Wells 2016-06-06 08:06:20 -07:00
  • 14e2f1f579 make trash subdir in case missing. Matt Wells 2016-05-30 19:37:03 -07:00
  • bd5618f1b7 fix spider request corruption detection in doledb Matt Wells 2016-05-28 08:56:36 -07:00
  • fe21fb1cc6 fix spider detection of corrupted requests. fix deduping so it doesn't core on docid based spider requests. Matt Wells 2016-05-28 08:40:59 -07:00
  • 7bd2344f41 increase regular page download timeout from 30 seconds to 60 seconds to accommodate some slower websites. Matt Wells 2016-05-18 10:05:02 -07:00
  • 494f7ca645 reduce diffbot timeout from 18000 secs to 240 secs Matt Wells 2016-05-18 09:55:15 -07:00
  • b0e015b97d fix dns lookup bug that was causing us to get incorrect ips sometimes. Matt Wells 2016-05-17 11:57:21 -07:00
  • 77dc78d122 fix empty file bug again Matt Wells 2016-05-13 14:49:54 -07:00
  • 39e621f655 trash files of length 0 that are holding up a merge. if we can't merge files we end up stockpiling them and things get slow fast. Matt Wells 2016-05-13 13:21:43 -07:00
  • dfca68ec46 Merge pull request #99 from vonbetz/regexdebug Gigablast 2016-05-13 10:06:24 -06:00
  • c89d963d46 Merge pull request #100 from vonbetz/tldfixes Gigablast 2016-05-13 10:01:01 -06:00
  • d4dc85bf18 Fixes for new tlds. They can now contain '-' and numbers. Fix punycode url encoding: set max length before encoding each url chunk. Zak Betz 2016-05-12 16:04:21 -06:00
  • 0074f2ec73 Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine into regexdebug Zak Betz 2016-05-11 14:07:41 -06:00
  • 0ee1e6164c Add input validation to regexs before crawlbot collections are created. Add ./gb egrep command to test regexes. Zak Betz 2016-05-11 14:03:45 -06:00
  • a246289238 Merge pull request #96 from vonbetz/diffbot-testing Gigablast 2016-05-11 09:32:21 -06:00
  • 89f3344be5 Updated tld list with the most current list. Zak Betz 2016-05-04 11:17:51 -06:00
  • 8c3eacc338 fix cores from XmlDoc::getLinkInfo1() returning -1 because of its call to getFirstIp() presumably. Matt Wells 2016-04-18 09:55:00 -07:00
  • c5de65a78a more core dump fixes concerning -1 being returned for XmlDoc::getLinkInfo1() Matt Wells 2016-04-17 18:50:23 -07:00
  • 65856e3b6a tell malloc to trim 100MB at a time to prevent kernel destroying the cpu by defraging/compacting memory. fix core in Title.cpp. Matt Wells 2016-04-17 09:16:03 -07:00
  • 95a3a261db fix so host 8 doesn't jam things up so much. host 8 was too busy merging a large spiderdb for blouartinfo and unable to tend to other smaller merges and therefore the # of files was getting out of hand causing slowdowns. so merge spiderdb much less aggressively. Matt 2016-04-15 13:34:41 -06:00
  • 165f724fd7 thanks to isj for the puny code fixes Matt 2016-04-06 11:10:42 -06:00
  • 74cfde3e53 fix calling doneSendingNotification() with a just-freed memory ptr bug. Matt 2016-04-06 10:53:08 -06:00
  • d2983747b8 fix sending back reply that has some stuff on the stack that it references when XmDoc::getMsg20Reply() returns. thanks to isj for this fix. Matt 2016-04-06 10:43:43 -06:00
  • 8891100c2a fix add url on root page to set collnum properly. fix Summary::getBestWindow() underrun bug. Matt 2016-04-06 10:31:04 -06:00
  • 70ca2fe48c update ./gb -h desc for ./gb inject. Matt 2016-04-05 21:06:38 -06:00
  • f5d0045b43 Merge pull request #82 from vonbetz/testing Gigablast 2016-03-29 13:12:56 -06:00
  • 3c140b87aa Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2016-03-29 12:42:05 -06:00
  • cf7ec13de6 Fix international domain printing bug. Zak Betz 2016-03-29 12:41:34 -06:00
  • 33e76af1d1 Merge branch 'testing' Matt 2016-03-29 04:11:30 -06:00
  • 816d69b34c a lot of bug fixes thanks to isj. Matt 2016-03-29 04:08:17 -06:00
  • 5072e851b7 fix misspelling Matt 2016-03-28 17:26:40 -06:00
  • 5935619eb2 hack on parentUrlDocId to the json object dump of diffbot objects. Matt 2016-03-28 12:39:48 -06:00
  • cab6d5c519 fix keysize==8 bug in keycmp Matt 2016-03-28 09:17:01 -06:00
  • b65a16caee Merge branch 'diffbot-testing' into testing Matt 2016-03-22 16:25:21 -06:00
  • 3c743a7d0e allow more docids to be downloaded/served in search results. Matt Wells 2016-03-22 15:24:33 -07:00
  • 04a8433256 show gbssParentDocId in status doc for children docs, like diffbot object docs. Matt Wells 2016-03-22 09:00:10 -07:00
  • 483d69d7f7 added httprequest debug line Matt Wells 2016-03-21 14:46:10 -07:00
  • 136d23816c fix hashbang properly Matt Wells 2016-03-21 09:29:55 -07:00
  • 48398d0cd7 Merge branch 'diffbot-testing' into testing Matt 2016-03-20 23:14:26 -06:00
  • 136b8842db fix more data corruption bugs. hopefully will dump out all the collections this time and not leave any in the tree, otherwise, especially if there are a lot left behind, they get corrupted. Matt Wells 2016-03-20 21:04:01 -07:00
  • 61ef806dea hash bang fix. detect more corruption. don't dump titledb and spiderdb at same time, seems to reduce corruption in rdbmem. Matt Wells 2016-03-20 12:50:43 -07:00
  • fc495a5bf5 fix dump core when collection deleted while dumping Matt Wells 2016-03-18 06:46:38 -07:00
  • 8922b8e69c Merge branch 'diffbot-testing' into testing Matt 2016-03-17 14:31:22 -06:00
  • 56bde4c3ef fix the data corruption fix Matt Wells 2016-03-17 13:22:56 -07:00
  • 8bc653c31c after dump completes scan tree to ensure all nodes reference secondary mem ptr so they don't get their data overwritten. Matt Wells 2016-03-17 10:09:49 -07:00
  • 0caf345850 if running ./gb start and another gb is already bound on the port then quickly exit(0) and have the bash keep alive loop exit the loop based on that return value. we can't use ./cleanexit file because it doesn't get remove and will mess up the main process that is running. Matt Wells 2016-03-16 16:56:48 -07:00
  • 36fdbf2f5a rename log files in the gb main.cpp code not in the bash loop. do not rename the log file if failed to start gb because socket was already bound. prevents us from double starts moving the log file, which is annoying. Matt Wells 2016-03-16 16:08:08 -07:00
  • a2e8a3a1fd use ./cleanexit file to ensure gb doesn't restart after a graceful exit in the bash keep alive loop. Matt Wells 2016-03-16 14:57:19 -07:00
  • 7396e57660 show docids of corrupted title recs found. show key range of each dump to disk. fix 'sentToDiffbot' bug for unchanged docs in status docs. make sure firstKeyInQueue is set properly from current key, so reset list ptr before doing that in RdbDump.cpp. Matt Wells 2016-03-16 13:53:08 -07:00
  • 5e8c47adfd Merge branch 'diffbot-testing' into testing Matt 2016-03-16 01:14:37 -06:00
  • 1faff50f5a if msg22a never called to get docid, then error out. Matt Wells 2016-03-16 00:14:02 -07:00
  • c7c8c9e5ad Merge branch 'diffbot-testing' into testing Matt 2016-03-16 00:54:49 -06:00
  • 0b5f417349 if old title rec was corrupted we would get a random docid when re-spidering the url causing some chaos. now things should return to normal and we should overwrite the corrupted titlerec on the next spidering. also, no longer do robots.txt titlerec lookups. silly. Matt Wells 2016-03-15 23:26:57 -07:00
  • 58993dbbf9 do not allow crawlbot seeds to be deduped out Matt Wells 2016-03-15 20:42:28 -07:00
  • bf45db6f48 Merge branch 'diffbot-testing' into testing Matt Wells 2016-03-15 15:55:55 -07:00
  • 8a65d21371 fix the source of lots of corruption in spiderdb and titledb. rdbmem.cpp was storing in secondary mem which got reset when dump completed. also do not add keys that are in collnum and key range of list currently being dumped, return ETRYAGAIN. added verify writes parm. clean out tree of titledb and spiderdb corruption on startup. Matt Wells 2016-03-15 15:54:12 -07:00
  • 0fdbaa4196 makefile optimizations Matt Wells 2016-03-14 16:34:24 -07:00
  • 0dbc304bbf fix to allow us to gather ip-only url outlinks again Matt 2016-03-14 10:56:33 -06:00
  • 2c167aada7 fix redirect to self bug that requires setting cookie Matt 2016-03-14 10:33:05 -06:00
  • d6fe684b99 fix another core caused by deleted coll Matt Wells 2016-03-07 10:20:25 -08:00
  • d4e16a4dab pass a crawlbotnightly smoke Matt Wells 2016-03-04 13:14:28 -08:00
  • e75d80abbe ignore meta redirect tags in html comment tags. Matt Wells 2016-02-22 12:41:03 -08:00
  • 412b04bbd4 fix neverending crawl rounds by only trying each url once per round. updated url filters. Matt Wells 2016-02-22 09:28:46 -08:00
  • da9949f462 try to fix a couple more core dumps. Matt Wells 2016-02-19 08:54:48 -08:00
  • c7696a69eb fix core from a federated query and null msg20 Matt Wells 2016-02-18 10:53:20 -08:00
  • f649944573 if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal. Matt Wells 2016-02-16 12:25:49 -08:00
  • 8748cd06ac Merge pull request #73 from AppChecker/master Gigablast 2016-02-13 23:00:37 -07:00
  • f11595efc3 fix core dump from deleting an active/dumping collection Matt Wells 2016-02-12 16:54:03 -08:00
  • bf4bdd6bfd Merge branch 'diffbot-testing' into testing Matt Wells 2016-02-10 09:50:53 -08:00
  • e68406f073 fix core in posdbtable from docid of 0. no idea why docid was 0, but why core? Matt Wells 2016-02-09 22:43:09 -08:00
  • e376b97814 let's generalize it. if a redirect sets cookies then follow it through, don't stop in the middle because we think it is 'simplified'. Matt Wells 2016-02-09 13:47:12 -08:00