Commit Graph

  • f78be592d4 Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2015-12-18 12:06:08 +01:00
  • 784a38b198 Added logging and hard coded error checks to BigFile.cpp Brian Rasmusson 2015-12-18 12:05:23 +01:00
  • 131780e640 Remove commented out code Ai Lin Chia 2015-12-18 11:56:10 +01:00
  • 4122e5e63b Remove unused variable in Log.cpp Ai Lin Chia 2015-12-18 11:48:09 +01:00
  • 8099278f88 Add test for Process::shutdownAbort Ai Lin Chia 2015-12-18 11:47:42 +01:00
  • f5d4f0bf26 Removed some commented out codes Ai Lin Chia 2015-12-18 10:50:32 +01:00
  • d3aca67c31 Add Process::abort method which will write fatal_error file that prevents gb from starting again Ai Lin Chia 2015-12-18 10:48:00 +01:00
  • 616e3f6da3 Remove double semicolon Ai Lin Chia 2015-12-18 01:09:47 +01:00
  • f4693a011b Always log error Ai Lin Chia 2015-12-17 22:27:51 +01:00
  • ccf3a90522 fix spider req corruption Matt Wells 2015-12-16 20:33:45 -08:00
  • b0c184bfd1 add new link to page crawlbot to see spider attempt gbss docs. Matt Wells 2015-12-15 16:22:53 -08:00
  • 22c33b22c5 improve spiderreq corruption detection Matt Wells 2015-12-15 14:23:21 -08:00
  • c9c7409f05 fix for all urls getting malformed url (EBADURL) while spidering. had to add 'errorcode==' to urlfilters to just redo those pages otherwise having that error, a non-temporary error, would have barred them from being retried in the future. Matt Wells 2015-12-15 10:06:06 -08:00
  • 94544893a0 Added hexdump log option and started to add logging to BigFile.cpp Brian Rasmusson 2015-12-17 23:17:09 +01:00
  • e07ba907be Log the log level and thread id Brian Rasmusson 2015-12-17 20:03:09 +01:00
  • ac35858f66 Don't allow gb to start when fatal_error file exist Ai Lin Chia 2015-12-17 16:05:43 +01:00
  • 766a6eeb83 Removed un-implemented decl BigFile::addSig() Ivan Skytte Jørgensen 2015-12-17 15:51:06 +01:00
  • 404badc795 Remove hard-coded IP-addresses Ivan Skytte Jørgensen 2015-12-15 16:55:03 +01:00
  • 1540ce0754 Use #defines for DNS servers Ivan Skytte Jørgensen 2015-12-15 15:09:11 +01:00
  • dae1de7d0e more fixes for ebadurl bug Matt Wells 2015-12-14 17:49:06 -08:00
  • 10a2b7a104 have diffbot retry non tmp errors to make up for bug of calling valid urls malformed EBADURLs Matt Wells 2015-12-14 17:30:41 -08:00
  • 3cfe03db43 fix a fix Matt Wells 2015-12-14 17:06:58 -08:00
  • d75e979756 use small dgrams to avoid splitting at the kernel level down to the mtu. increase 0xc1 msg request delay from 3 to 20 secs. need to make it linear order. Matt Wells 2015-12-14 10:47:07 -08:00
  • 887c6697ac Merge conflict c161f1719627961026107ba80b7c42e8ba66321f Matt Wells 2015-12-12 19:18:28 -08:00
  • 4934afd3a7 do not dedup if &links is in diffbot api url (or ?links) Matt Wells 2015-12-09 16:52:11 -08:00
  • 04e67aeb56 fix for www.gov.uk having iswwwdup bug. because .gov.uk is a tld and so is .uk. also added code to handle seg fault signals better and run the default handler after saving rather than calling abort(). hopefully a core will be dumped all the time now. Matt Wells 2015-12-09 15:54:53 -08:00
  • 920fe420cb added some more quickpolls. improved heartbeat log msg. timed pthread_join. brought back max heart beat delay parm. Matt Wells 2015-12-04 09:02:03 -08:00
  • 3103144f2b do not do spider-time deuping if &links is in the diffbot api url. Matt 2015-12-02 13:09:42 -07:00
  • 05ef4e3493 More merges from gigablast Matt Wells 2015-12-01 09:03:59 -08:00
  • 5a7053fb46 fix truncation of search results some more hopefully Matt Wells 2015-11-30 16:33:03 -08:00
  • 10dd8b5cc3 fix a couple of cores happening on crawlbot. fix bug of a urls.csv or other streaming download being truncated because gb thinks a shard is down. even if it is down, wait for it to come back up. Matt Wells 2015-11-30 13:26:43 -08:00
  • e0bb3974d1 fix urgent merge mode bug some more? limit spiders to 5 per custom crawl coll per shard. Matt Wells 2015-11-24 08:51:18 -08:00
  • 6c516ad86e fix spider proxy table bug that seemed to be the reason for the table getting so full. but in case it does get full again added a call the hashtablex::empty() so we don't freeze up any more. Matt Wells 2015-11-21 10:43:23 -08:00
  • da7c166ca2 tune spider proxy table flushing logic a bit Matt Wells 2015-11-21 10:29:02 -08:00
  • 935f3f050c try to fix the proxy load balancing table logic some more. seems to not cleanup after itself very well. Matt Wells 2015-11-21 10:20:20 -08:00
  • cd0a2dbe19 fix spider proxy clean up algo a little so it won't freeze up Matt Wells 2015-11-11 08:27:09 -08:00
  • 185dc25631 fix bug of dumping too many files to disk and not being able to merge, and corrupting RdbBase::m_files[] array and associated arrays. Matt Wells 2015-11-17 09:52:41 -08:00
  • f10589ed4e Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2015-12-15 14:12:48 +01:00
  • a38d8da0f0 Do not attempt to spider a site if getting robots.txt returns errors other than 404. Brian Rasmusson 2015-12-15 14:12:06 +01:00
  • 04179efc10 Forgotten commit, and git doesn't like me Ivan Skytte Jørgensen 2015-12-15 14:06:38 +01:00
  • 94b2b6b0d3 Use const in iptoa() parameters Ivan Skytte Jørgensen 2015-12-15 13:41:54 +01:00
  • 37b639f7f5 bugfix: use correct loop condition when initializing Conf::m_dns* Ivan Skytte Jørgensen 2015-12-15 13:36:39 +01:00
  • d2c6d8710f Removed unused member Multicast::m_timeoutPerHost Ivan Skytte Jørgensen 2015-12-15 12:02:54 +01:00
  • bdaaaacd23 Bugfix for robots.txt handling on https sites. Removed some dead code. Brian Rasmusson 2015-12-15 12:46:29 +01:00
  • 4a4a264058 Bugfix for robots.txt handling on https sites. Removed some dead code. Brian Rasmusson 2015-12-15 12:37:07 +01:00
  • b1e8e4b854 Multicast cleanup Ivan Skytte Jørgensen 2015-12-14 15:41:31 +01:00
  • 1bafa03fa9 Bugfix too small array Multicast::m_hostPtrs Ivan Skytte Jørgensen 2015-12-14 15:06:28 +01:00
  • 5205ea0bba Removed global g_hostdb2 Ivan Skytte Jørgensen 2015-12-14 12:41:51 +01:00
  • 6a9defd2b0 Cleanup in Multicast.* Ivan Skytte Jørgensen 2015-12-14 12:27:56 +01:00
  • 7ded248d04 Removed more unused and write-only members fromMulticast Ivan Skytte Jørgensen 2015-12-14 11:44:37 +01:00
  • f8e9dac6de Removed AutoBan, dmoz & related code (Categories and related Msg/Db), scraping code. Ai Lin Chia 2015-12-11 19:07:14 +01:00
  • 58bd02d34a Switched google test submodule to .git url Ai Lin Chia 2015-12-11 21:51:49 +01:00
  • 2e805b99d3 Removed retry-forever meber and logic from Multicast.* Ivan Skytte Jørgensen 2015-12-11 16:19:36 +01:00
  • 1df3a2edf1 Use /etc/hosts by default Ivan Skytte Jørgensen 2015-12-11 15:47:02 +01:00
  • fd1a6db738 Don't add hidden input fields for checkboxes Ivan Skytte Jørgensen 2015-12-11 14:38:51 +01:00
  • a747f14bb7 Remoed yippy-specific functiongotTcpReplyWrapper() which is now unused Ivan Skytte Jørgensen 2015-12-11 11:48:49 +01:00
  • e230c55e38 Removed unused functions from Profiler.cpp Ivan Skytte Jørgensen 2015-12-11 11:47:48 +01:00
  • cf218d02cf Removed logic and variables dealing with yippy Ivan Skytte Jørgensen 2015-12-11 11:45:07 +01:00
  • 983567b180 Don't initialize array in constructor initializer list Ai Lin Chia 2015-12-11 11:36:31 +01:00
  • 8665972f51 Use /usr/bin/env instead of /bin/env which doesn't exist on ubuntu Ai Lin Chia 2015-12-10 22:18:01 +01:00
  • 54d9d28ffd Fixed typo in gbconvert.sh Ai Lin Chia 2015-12-10 16:48:34 +01:00
  • 00d4c2a124 Fix make dist target. We should copy the new gbconvert.sh & gbcheck.sh files as well Ai Lin Chia 2015-12-10 11:03:11 +01:00
  • 4286fe3bc3 More cleanup & formating changes Ai Lin Chia 2015-12-10 10:59:26 +01:00
  • d7b3e5b3fa Remove openssl headers. We're using system headers anyway. Ai Lin Chia 2015-12-09 14:31:48 +01:00
  • 0b98f2c337 Remove some unused methods/class. Minor restructuring of test files. Ai Lin Chia 2015-12-09 11:25:18 +01:00
  • 8df3deff5f Removed unused member Multicast::m_startTimeMS Ivan Skytte Jørgensen 2015-12-10 11:55:59 +01:00
  • 655b4eaf97 Ignore pdf HTTP content-type encoding because pdftohtml will always convert to UTF-8 Ai Lin Chia 2015-12-08 16:09:48 +01:00
  • 9c7ddd1382 Removed global 'g_udpServer2' Ivan Skytte Jørgensen 2015-12-08 14:44:02 +01:00
  • 8ea76f7b7f Removed 'realtime' parameter from Multicast::send() Ivan Skytte Jørgensen 2015-12-08 14:32:36 +01:00
  • 7e95cb1c52 Added const qualifier to parameters in hash.* Ivan Skytte Jørgensen 2015-12-08 12:09:35 +01:00
  • 4213099332 Remove pdftohtml from make dist target Ai Lin Chia 2015-12-08 11:30:16 +01:00
  • 93e0bcc97f Use system's pdftohtml. Verify that system's pdftohtml exists during startup. Ai Lin Chia 2015-12-08 11:21:06 +01:00
  • 7fa34990ae Remove sendPageSEO function. Minor cleanup on main.cpp. Ai Lin Chia 2015-12-08 05:41:31 +01:00
  • 8a616ae773 Remove unused images/html files. Remove pages not necessary for gb to run (except documentation pages). Ai Lin Chia 2015-12-08 05:37:32 +01:00
  • 4c4b99a9cb Change doxygen setting Ai Lin Chia 2015-12-07 23:47:28 +01:00
  • b7632beff9 Modify some doxygen config Ai Lin Chia 2015-12-07 22:01:19 +01:00
  • ffaec1ba3d Remove unused cmp methods Ai Lin Chia 2015-12-07 19:15:29 +01:00
  • e24d5e0dc0 Remove dosopen command Ai Lin Chia 2015-12-07 19:10:39 +01:00
  • 8458415341 Cleanup g_siteBonus and unused Wiki methods Ai Lin Chia 2015-12-07 18:45:33 +01:00
  • 4b7020bf88 Remove commented out DateParse & code from main.cpp Ai Lin Chia 2015-12-07 18:25:05 +01:00
  • eaa948fa20 Remove Scraper logic. It looks to be scraping google with random words and seeding it into gb. Ai Lin Chia 2015-12-07 17:48:44 +01:00
  • 08a757ec00 More minor cleanup on main.cpp Ai Lin Chia 2015-12-07 17:45:12 +01:00
  • d44cabfdb2 fctypes: const + remove unused Ivan Skytte Jørgensen 2015-12-07 19:50:31 +01:00
  • 08986fd341 Removed unused function reverseBits() Ivan Skytte Jørgensen 2015-12-07 18:25:38 +01:00
  • 8a193adf8a Removed hand-coded memcpy(), memset(), etc. Ivan Skytte Jørgensen 2015-12-07 18:09:20 +01:00
  • 99e7c72fe1 Removed unsused functiosn from Mem.* Ivan Skytte Jørgensen 2015-12-07 17:03:18 +01:00
  • ded0087bff Remove tfndb related commented out code from main.cpp Ai Lin Chia 2015-12-07 16:53:56 +01:00
  • a32aff39a6 Fix system test where wrong parameter is passed Ai Lin Chia 2015-12-07 16:11:08 +01:00
  • 43e93e1769 Removed effectively unused Msg20::m_owningParent Ivan Skytte Jørgensen 2015-12-07 16:10:18 +01:00
  • 6976364c16 Removed unused members from Msg20 Ivan Skytte Jørgensen 2015-12-07 16:07:17 +01:00
  • 8b5a5da228 Remvoed write-only members from Msg20Reply Ivan Skytte Jørgensen 2015-12-07 15:46:14 +01:00
  • 683e52b6cd Removed hardcoded IP-address range check. Ivan Skytte Jørgensen 2015-12-07 15:42:39 +01:00
  • 47201a8739 Removed unused members from Msg20Reply Ivan Skytte Jørgensen 2015-12-07 14:44:20 +01:00
  • 06e238281a Removed unused and write-only members from Msg20Request Ivan Skytte Jørgensen 2015-12-07 14:04:21 +01:00
  • fa996e349d Removed unused #defines from Msg20.h Ivan Skytte Jørgensen 2015-12-07 12:01:57 +01:00
  • 6899cd785d Remove all reference/commented out code to checksum db which is not used anymore Ai Lin Chia 2015-12-07 15:58:53 +01:00
  • 76d55c5c00 Treat mime-type with 'text/' as text so we'll index them Ai Lin Chia 2015-12-07 12:41:39 +01:00
  • c492f41124 bugfix JSON format for cache_available Ivan Skytte Jørgensen 2015-12-07 11:55:34 +01:00
  • b5c4df1890 Remove escaping solidus '/' for json encode Ai Lin Chia 2015-12-07 11:32:31 +01:00
  • b0f9df146f Send back field 'cacheAvailable' in JSON to tell the frontend if a cached version is available Ivan Skytte Jørgensen 2015-12-07 11:31:00 +01:00