Commit Graph

  • bebf9a1c01 Moved globals g_pid and g_filterTimeout from XmlDoc to locals in Threads.cpp Ivan Skytte Jørgensen 2016-01-26 12:03:15 +01:00
  • ce1501dda7 Moved global g_ticker from XmlDoc to a local in Threads.cpp Ivan Skytte Jørgensen 2016-01-26 12:01:08 +01:00
  • a31aeb917a Removed unused globals g_xd and g_od, and make g_wtab local Ivan Skytte Jørgensen 2016-01-26 11:57:34 +01:00
  • 5a77f7194a Removed unused #define MAXMSG7S Ivan Skytte Jørgensen 2016-01-26 11:48:23 +01:00
  • 98cb6a7b7a Removed unused members XmlDoc::m_xmlDocs and XmlDoc::m_used Ivan Skytte Jørgensen 2016-01-26 11:47:49 +01:00
  • 289ff46c33 Removed unused structure SubSent Ivan Skytte Jørgensen 2016-01-26 11:41:00 +01:00
  • 09ac5856a0 Removed unused functions gbcompress7() and gbuncompress7() Ivan Skytte Jørgensen 2016-01-26 11:39:44 +01:00
  • 46fd32db7e Removed address and placedb functionality Ivan Skytte Jørgensen 2016-01-26 11:11:57 +01:00
  • 09236a7259 More domains added to spidering blacklist Brian Rasmusson 2016-01-26 11:09:36 +01:00
  • 91dbe918f7 fix critical spider issue of an IP corking an entire spider priority. also exit faster on 'save & exit' if in evalIpLoop(). Matt Wells 2016-01-23 10:32:14 -08:00
  • 19351f944e detect more corrupted spider records caused by saving memory to disk after a segv corrupts the memory. Matt Wells 2016-01-21 09:57:43 -08:00
  • 4b7efd45ce improve spierrequest::isCorrupt() Matt Wells 2016-01-19 22:24:52 -08:00
  • a1d0b3bb70 fix a core from dumping doledb out to stdout Matt Wells 2016-01-11 10:32:47 -08:00
  • 87bcdb9d59 fix strange corruption in doledb core Matt Wells 2016-01-11 09:40:55 -08:00
  • d4dd548a15 fix core dump from bad langid of 99 Matt Wells 2016-01-05 14:32:02 -08:00
  • 1db006c8d9 fix bug of losing the hopcount 0 spiderrequest because it gets overridden by a link to itself then it becomes hopcount 1. Matt 2016-01-05 13:49:54 -07:00
  • f51fd1be1b No longer add non-indexable links to spiderdb Brian Rasmusson 2016-01-25 22:33:46 +01:00
  • 3141874574 valgrind: fix uninitialized fixedDistance -> PairScore::m_fixedDistance -> network buffer Ivan Skytte Jørgensen 2016-01-25 18:35:25 +01:00
  • 97851aed44 how did this even link? Ivan Skytte Jørgensen 2016-01-25 17:33:13 +01:00
  • 0601043a51 valgrind: avoid testing undefined bytes in SpiderColl::m_siteListIsEmpty Ivan Skytte Jørgensen 2016-01-25 17:06:49 +01:00
  • 174eecb70a Moved a few URL filtering functions to the Url class Brian Rasmusson 2016-01-25 16:47:26 +01:00
  • a1e7180724 msg39_reply no longer sends uninitialized bytes Ivan Skytte Jørgensen 2016-01-25 16:40:59 +01:00
  • ae9e33a830 valgrind: PairScore had uninitialized bytes in internal padding Ivan Skytte Jørgensen 2016-01-25 16:40:34 +01:00
  • 99dd8e27d1 valgrind: SingleScore structure had uninitialized padding Ivan Skytte Jørgensen 2016-01-25 16:37:50 +01:00
  • 55ec8ada61 valgrind: check Msg39Reply buffer Ivan Skytte Jørgensen 2016-01-25 16:36:33 +01:00
  • a2a26890a2 valgrind: fix undefined bytes in DocIdScore padding Ivan Skytte Jørgensen 2016-01-25 16:20:15 +01:00
  • 06ac7deed0 valgrind: fix undef bytes in SpiderReply::print() Ivan Skytte Jørgensen 2016-01-25 15:20:32 +01:00
  • 7fb46537f2 more valgrinding Ivan Skytte Jørgensen 2016-01-25 14:30:39 +01:00
  • d333a7ca79 valgrind: ensure titlerec fields is has defined bytes Ivan Skytte Jørgensen 2016-01-25 14:03:52 +01:00
  • 9c88d4c5c3 valgrind: undefined bytes originating from Address::m_numNonDupAddresses Ivan Skytte Jørgensen 2016-01-25 13:59:04 +01:00
  • 2905ea2350 Removed old use of _VALGRIND_ macro Ivan Skytte Jørgensen 2016-01-25 12:02:10 +01:00
  • 1b70557444 Added some tracing log to XmlDoc Brian Rasmusson 2016-01-23 21:16:29 +01:00
  • 59fbf75226 Use the high-frequency term shortcut file Ivan Skytte Jørgensen 2016-01-22 16:13:28 +01:00
  • 2ca6015ba7 Fix for site: searches Brian Rasmusson 2016-01-22 15:50:44 +01:00
  • 1b38f7ec2c Detect high-frequency terms based on extern frequency file Ivan Skytte Jørgensen 2016-01-21 17:24:01 +01:00
  • 8941862902 Make Log.h self-contained Ivan Skytte Jørgensen 2016-01-21 16:55:59 +01:00
  • f5ec87640d search for site:dk now works, but should be refined further. On terms page, term hashes are now 48-bit as stored in posdb Brian Rasmusson 2016-01-21 17:22:16 +01:00
  • 7c8205524b Removed unused parameter forceParitySplit from Msg2::getLists() Ivan Skytte Jørgensen 2016-01-21 15:38:29 +01:00
  • 9d9a7d4a29 Removed effectively constant-value parameter restrictPosdb from Msg2::getLists() and associated member m_restrictPosdb Ivan Skytte Jørgensen 2016-01-21 15:25:31 +01:00
  • f1a52d3d15 Removed effectively constant-value Msg39::m_restrictPosdbForQuery (always false) Ivan Skytte Jørgensen 2016-01-21 15:20:27 +01:00
  • 3acc03ff71 Removed outdated comments Ivan Skytte Jørgensen 2016-01-21 14:22:31 +01:00
  • 2480728703 Removed outdated comments and commented-out code Ivan Skytte Jørgensen 2016-01-21 14:21:11 +01:00
  • b0169d8391 Removed outdated comments Ivan Skytte Jørgensen 2016-01-21 12:31:52 +01:00
  • 439b855136 Trying to fix summary best window by playing around with score. Hopefully this will give a better summary for wikipedia and some other sites Ai Lin Chia 2016-01-20 15:57:11 +01:00
  • 2b084cdd23 Fix ellipsis handling when we're less than 4 characters from limit. Fix unit test. Ai Lin Chia 2016-01-20 13:49:30 +01:00
  • 869be5099c Fix todo comments Ai Lin Chia 2016-01-20 13:32:48 +01:00
  • ac8249e07d Improve title/summary for youtube Ai Lin Chia 2016-01-20 13:32:13 +01:00
  • 40af20df8d Tweak min title length Ai Lin Chia 2016-01-20 10:35:01 +01:00
  • 22c4e92023 Remove duplicated unit test Ai Lin Chia 2016-01-19 17:01:57 +01:00
  • 46246af716 Trim ellipsis from title or summary. We'll add it outselves. Ai Lin Chia 2016-01-19 14:26:35 +01:00
  • a2c69eb127 Tweaking max number of words for summary Ai Lin Chia 2016-01-19 11:42:16 +01:00
  • 077a31f6a4 Removed unused local variable Ivan Skytte Jørgensen 2016-01-19 16:02:17 +01:00
  • 897117b0f2 Minor cleanup of Msg2::getLists() Ivan Skytte Jørgensen 2016-01-19 15:58:52 +01:00
  • 468b4f82a4 Remove diffbot specialities from Images.cpp Ivan Skytte Jørgensen 2016-01-19 15:12:17 +01:00
  • 97e35c3f10 Removed diffbot-reply parameter from Links:.set() Ivan Skytte Jørgensen 2016-01-19 13:56:52 +01:00
  • 35817ef817 Fix bug where we're referencing uninitialized buffer Ai Lin Chia 2016-01-18 20:16:06 +01:00
  • d3261c495a Add unit test for Pos::filter. Fix bug in previous commit where an all caps word will be uncapitalized. Instead of all caps buffer. Ai Lin Chia 2016-01-18 19:09:38 +01:00
  • a50b38c62a Fix unit test Ai Lin Chia 2016-01-18 18:38:47 +01:00
  • cc32fbfd6d Fix line-endings on XmlDoc Ai Lin Chia 2016-01-18 17:29:11 +01:00
  • 0117b2148e When all caps title/summary is encountered, capitalize only start of every 'word'. This is done only for all caps ascii to avoid handling special cases for now. Ai Lin Chia 2016-01-18 17:16:45 +01:00
  • 80488f1444 Fix title logging Ai Lin Chia 2016-01-18 17:15:45 +01:00
  • 0ba74618ed Remove Directory side link Ai Lin Chia 2016-01-17 16:57:08 +01:00
  • 1a5538713b Log name of missing file as error Ai Lin Chia 2016-01-17 16:56:54 +01:00
  • 02770af355 Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2016-01-18 15:53:46 +01:00
  • 8ba2b4df1b Filtering out more stuff before hashing for posdb. Undo XmlDoc.cpp linefeed screw-up Brian Rasmusson 2016-01-18 15:52:54 +01:00
  • 793c26ed35 Removed diffbot hack Ivan Skytte Jørgensen 2016-01-18 15:37:59 +01:00
  • a65db9867a Do not hash num inlinks, tagvector, common web words (www, http) Brian Rasmusson 2016-01-17 23:27:29 +01:00
  • e0fac73695 Got rid of default parameter values in Url.cpp set functions. Added support for removal of common tracking parameters in URLs. Brian Rasmusson 2016-01-17 22:49:02 +01:00
  • aaac641766 No longer hash tag records and links from unwanted domains Brian Rasmusson 2016-01-17 16:48:18 +01:00
  • ffa27dd1ba added svg to list of media extensions Brian Rasmusson 2016-01-15 21:33:18 +01:00
  • c3af74ff33 Removed duplicate extension check Brian Rasmusson 2016-01-15 17:27:55 +01:00
  • 5d5c66ff2f Merge branch 'master' of https://github.com/privacore/open-source-search-engine Brian Rasmusson 2016-01-15 17:23:29 +01:00
  • fe5857a4d9 Do not store hashes of media URLs, script URLs and unwanted domains in posdb Brian Rasmusson 2016-01-15 17:23:24 +01:00
  • 44fc9e9288 Make sure we null terminate string Ai Lin Chia 2016-01-15 16:59:05 +01:00
  • c0b519bfdb We should only use meta tags for HTML documents Ai Lin Chia 2016-01-15 16:25:08 +01:00
  • bed3182988 Use meta tags (og:title & title) & title tag when available for generating title Ai Lin Chia 2016-01-15 15:52:17 +01:00
  • a1968fa4e6 Fix use-after-free in doneSendingNotification() Ivan Skytte Jørgensen 2016-01-15 15:16:42 +01:00
  • 012458a9d4 Remove always false variable Ai Lin Chia 2016-01-15 11:20:11 +01:00
  • dda9db226f Make some Xml member variable private Ai Lin Chia 2016-01-15 11:11:38 +01:00
  • 1eb206fffa Fix calculation of end XML node. Ignore itemprop description if it's too long. Remove commented out codes. Ai Lin Chia 2016-01-14 17:12:52 +01:00
  • 55c6854da7 Remove unused variable Ai Lin Chia 2016-01-14 13:52:33 +01:00
  • f3291523fa Remove user info from Proxy Ai Lin Chia 2016-01-14 11:08:50 +01:00
  • 1a28e8911c Remove unused Monitordb code Ai Lin Chia 2016-01-13 17:32:46 +01:00
  • 3a8d287590 Remove commented out code in Tagdb Ai Lin Chia 2016-01-13 16:40:00 +01:00
  • 7177874617 Did not store server port and URL scheme correctly in posdb Brian Rasmusson 2016-01-13 15:55:11 +01:00
  • 1c94c8c065 Skip getting meta tags from inside gbframe (expanded iframe) Ai Lin Chia 2016-01-13 13:26:37 +01:00
  • abbf864b3d Use TAG_* enum instead of hardcoding number Ai Lin Chia 2016-01-13 13:25:34 +01:00
  • bd8e53ea39 Don't use meta tags/itemprop description for summary when it's less than 1/3 of what's needed Ai Lin Chia 2016-01-13 11:53:19 +01:00
  • 63d2ba1a76 Remove adding of offerPrice to title for json documents. Don't use title from A tag when it contains "share" in it. Ai Lin Chia 2016-01-13 11:52:10 +01:00
  • 692a69c01d Remove more not needed files Ai Lin Chia 2016-01-13 10:48:41 +01:00
  • 974f5da3e6 Remove unused TuringTest class Ai Lin Chia 2016-01-13 10:34:49 +01:00
  • a6e8c1b060 Remove unneeded files Ai Lin Chia 2016-01-13 10:24:24 +01:00
  • 11981b0edd Also clean cache files when doing a make cleandb Brian Rasmusson 2016-01-12 16:08:05 +01:00
  • 9ae607ecf6 Try to get a nicer summary by using what the website set as description Use the following in priority order (highest first) - itemprop = "description" - meta name = "og:description" - meta name = "description" Ai Lin Chia 2016-01-12 15:33:42 +01:00
  • 88333cf598 More valgrind suppressions Ivan Skytte Jørgensen 2016-01-12 15:45:32 +01:00
  • 42706b5a41 Don't traverse backward through array past its beginning Ivan Skytte Jørgensen 2016-01-12 15:37:49 +01:00
  • b18e363e3e valgrind: read memory below stack Ivan Skytte Jørgensen 2016-01-12 15:01:14 +01:00
  • 8960ab655e Made botname checked in metatags and robots.txt configurable Brian Rasmusson 2016-01-12 14:16:04 +01:00
  • 16a7796ae8 Made botname checked in metatags and robots.txt configurable Brian Rasmusson 2016-01-12 14:14:02 +01:00
  • c6cccb77db m_nodes[] were not always assigned values causing valgrind to (rightfully) complain Ivan Skytte Jørgensen 2016-01-12 13:52:05 +01:00