Commit Graph

  • d0be9f68a7 fix iswww directive mwells 2015-03-18 09:24:50 -0600
  • 6a1875b619 inline doc update mwells 2015-03-17 21:50:10 -0600
  • d5560c3e77 final fix for new delete column mwells 2015-03-17 21:47:31 -0600
  • c9b14b1b89 fix 'delete' checkbox in url filters. fix reading in of xml conf files that have </> tags. Matt 2015-03-17 21:20:27 -0600
  • dfc069aaa1 do away with filtered/banned spider priorities. add checkbox to signify force deletes to remove urls from index if in the index, or not allow them in. mwells 2015-03-17 20:27:23 -0600
  • dea534827e langidbits init bug leftover from searchinput reset memset fix i think. Matt 2015-03-17 15:04:31 -0600
  • 5b9aa0b0a5 added isroot url filter. mwells 2015-03-17 14:52:04 -0600
  • ebaaaeeef3 Merge branch 'testing' into diffbot-testing mwells 2015-03-17 14:41:33 -0600
  • 384761d4b5 fix build mwells 2015-03-17 14:40:43 -0600
  • f830eb43f7 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-03-17 14:31:33 -0600
  • a54471849b sitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted. Matt 2015-03-17 14:26:16 -0600
  • 29b2707ad7 fix searchinput::clear() bug. final fix for fhtqt memleak bug. Matt Wells 2015-03-15 07:48:29 -0700
  • 3b39b1d37a fix facet mem leak from QueryTerm::m_facetHashTable and safebuf when doing federated queries over a token. Matt Wells 2015-03-15 07:18:32 -0700
  • 427fae7135 fix log spam Matt Wells 2015-03-12 22:31:40 -0700
  • e71af6d26c recompute active list every 3 secs. otherwise it seems buggy and drops collections it shouldn't. Matt Wells 2015-03-12 22:22:19 -0700
  • 83be5d7d46 fix links parser so it harvests outlinks from rss feeds' <link> tags. it was doing this before, now it is doing it again. Matt 2015-03-12 17:35:47 -0700
  • 48435c55b1 Revert "fix json search results formatting." Matt Wells 2015-03-12 15:42:20 -0700
  • 7879537ab6 fix json search results formatting. Matt Wells 2015-03-12 14:25:37 -0700
  • 5d2a9d6d8c Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-03-12 14:00:03 -0700
  • 89b61d95a8 fix fix Matt Wells 2015-03-12 13:59:42 -0700
  • 3c2b082540 gbfacetstr: is case-sensitive. Matt Wells 2015-03-12 13:54:11 -0700
  • a8dfa56098 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-03-12 13:19:16 -0700
  • 485c600c7c fix for changing maxtocrawl/process/rounds. fix for thinking a crawl is done when it is just taking a while to populate doledb from the waiting tree for that SpiderColl. we just call populateDoledbFromWaiting in doneSleepingWrapper avery 50ms. it loops over every coll so it could be more efficient. Matt Wells 2015-03-12 13:15:52 -0700
  • a4e95899bd makefile updates for building pkgs Matt 2015-03-10 21:43:37 -0700
  • 1603201b3e Merge branch 'testing' Matt 2015-03-10 21:02:08 -0700
  • 8e72d6e4cc fix a couple critical xml parsing bugs. fixes parsing of rss feeds better and xml in general. fixed qa tests to ignore collection list when doing diff. Matt 2015-03-10 19:13:21 -0700
  • cdb1c28705 fix page parser core Matt 2015-03-10 16:49:45 -0700
  • 052f2b2009 fix for bug #2092 commented out to avoid qa testing for now. Matt 2015-03-10 15:45:33 -0700
  • c6fd5571d2 if "links":[ is specified in diffbot reply then crawlbot will parse out those links as if they were on the page. Matt 2015-03-10 14:36:44 -0700
  • ef66e1e69f speed up overflow check for firstip with a little cache thingy. reduce log spam from ssl msgs. Matt 2015-03-10 13:24:03 -0700
  • 7cf549bf2a fix spider request overflow/dropping algo. Matt 2015-03-10 13:07:00 -0700
  • 3efb5bd310 fix oopsie Matt Wells 2015-03-09 21:25:23 -0700
  • e346a14a47 added logic to retry diffbot reply on connection reset, connection timed out or gateway timed out (http status 504) msgs. added logic to detect truncated json (missing final }) and not print it. also, at index time, we set a diffbot missing curly error to g_errno so the whole url can be retried later. Matt Wells 2015-03-09 20:54:34 -0700
  • b4419340f7 Merge branch 'testing' Matt Wells 2015-03-08 21:00:28 -0700
  • eccb969e5b put in some fixes to deal with doledb tree that seems to have m_data[i] and m_data[j] pointing to the same thing. wtf? anyway, deal with that. it should fix the tree or something automatically at startup? Matt Wells 2015-03-08 20:36:13 -0700
  • 5e651997e7 fixed langid based query stop words. Matt 2015-03-08 15:44:23 -0700
  • 2413a9b9b1 query stop words now based on selected langid. Matt 2015-03-08 15:16:24 -0700
  • a25c358817 fix core dump caused by some corruption of some sort. for deleting json objs associated with a particular url. Matt Wells 2015-03-08 12:18:47 -0700
  • 5d0b283f6f Merge branch 'testing' Matt 2015-03-08 09:29:27 -0700
  • e368fa4fdc fix so when shutting down merges are suspended. disable threads earlier too. Matt Wells 2015-03-08 09:21:57 -0700
  • 236c4988aa ignore late msg20 reply if we already destroyed the state. a temp fix of a core for now. Matt Wells 2015-03-08 09:11:13 -0700
  • 1d0fd3d7d2 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-03-08 08:56:32 -0700
  • 79879976fa try to fix a couple cores. one when parsing bad json. the other in reclaiming doledb tree mem. Matt Wells 2015-03-08 08:56:10 -0700
  • 0f863e689b Merge branch 'testing' Matt 2015-03-07 20:21:01 -0800
  • 1cac2e282e fix core from adding a lot of sites to the sitelist. mwells 2015-03-07 20:57:17 -0700
  • 2f61f3b48c Merge branch 'testing' Matt 2015-03-07 16:29:37 -0800
  • 0752b85911 fix for infinite loop hang. Matt Wells 2015-03-07 16:28:53 -0800
  • 521bcf01a2 Merge branch 'testing' Matt 2015-03-07 15:52:41 -0800
  • b593027ff6 try to fix default query lang getting reset to nothing in the search controls. Matt 2015-03-07 15:39:04 -0800
  • 5fa01e285d dont show add url and widgets tabs in serps Matt 2015-03-07 15:32:35 -0800
  • 2839c38dac warc injection fixes Matt 2015-03-07 15:01:47 -0800
  • 605a384449 Merge branch 'testing' Matt 2015-03-07 14:57:48 -0800
  • 91af815558 fix missing dir/adv/addurl tabs Matt 2015-03-07 14:57:22 -0800
  • f887019b28 Merge branch 'testing' Matt Wells 2015-03-07 11:11:48 -0800
  • c6a59d0810 fix rdbcache corruption bugs for winnerlistcache. Matt Wells 2015-03-07 11:09:06 -0800
  • 213e430c31 Merge branch 'testing' Matt 2015-03-06 21:34:21 -0800
  • 102f2c1ea0 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-03-06 21:33:56 -0800
  • 5be4dc3c28 fix bug in overflow logic Matt 2015-03-06 21:33:40 -0800
  • 9cd39bb486 Merge branch 'testing' Matt 2015-03-06 20:37:56 -0800
  • 7d135219ed fix dmozparse compiler error Matt 2015-03-06 20:37:30 -0800
  • ed3c5dbc81 fix core from making a url too long after appending -diffbotxyz4000000000 to it Matt Wells 2015-03-06 19:17:16 -0800
  • 501382384d typo fix Matt Wells 2015-03-06 09:47:40 -0800
  • 1f8719b0cd Merge branch 'testing' Matt 2015-03-06 09:40:44 -0800
  • e0dd5720ac try to fix fdisset for writes on udpserver when sendto() would block... Matt Wells 2015-03-06 09:30:36 -0800
  • 4723d9eefa Merge branch 'testing' Matt Wells 2015-03-05 20:34:32 -0800
  • 4506b93ed3 fixed missing sites in sitelinks.txt Matt Wells 2015-03-05 20:32:01 -0800
  • 56d65a7c55 adjust dump tagdb cmdline cmd to start at a specified site to aid us in fixing sitelinks.txt missing some sites bug. Matt Wells 2015-03-05 19:27:36 -0800
  • ccd85ada31 do not give proxy for diffbot to use just yet. need to fix https CONNECT support to download the whole page first and then send that back. need to tell diffbot phantomjs to not worry about certificates then. Matt 2015-03-05 15:32:57 -0800
  • 367c4d5783 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-03-05 11:10:54 -0800
  • 95e3a760e9 proxy fixes Matt 2015-03-05 11:10:40 -0800
  • ad2e1f11fa Merge branch 'testing' Matt 2015-03-05 09:18:28 -0800
  • 56aaad3c9f do not double devance spider priority for a coll Matt 2015-03-05 09:11:36 -0800
  • 6bb9da7532 Merge branch 'testing' Matt 2015-03-05 08:54:03 -0800
  • dfd6d8b2cf fix critical spider bug that was deleting pages because of bogus SpiderReply::m_langId values! Matt 2015-03-05 08:49:39 -0800
  • 7ae5518c1d if no spider launched for a collection, decrement priority and try again. if we hit -1 priority then advance to next coll in the linked list. reset priority to max at start of each collection's round. Matt 2015-03-05 08:15:50 -0800
  • e8e5f9e005 qa test fixes Matt 2015-03-05 07:45:28 -0800
  • 459c65100c fix querying disabled bug. Matt Wells 2015-03-05 07:07:30 -0800
  • 0eafc68a13 debug msg helper Matt 2015-03-04 12:45:06 -0800
  • 707497ae1d fix new cgi parm names Matt Wells 2015-03-04 12:06:54 -0800
  • e768054bca typo Matt Wells 2015-03-04 10:50:45 -0800
  • 38caa517f2 add switches to disable injections or querying from the master controls, for all collections. Matt Wells 2015-03-04 10:49:37 -0800
  • 3fc9abe222 do not show any non-public page if allow cloud users is false and no permission. return 403. Matt Wells 2015-03-02 08:13:38 -0800
  • 93b505e7bb fix isCollAdmin() function to return false if not using coll passwords. they'll have to be master admin. Matt Wells 2015-03-02 07:47:05 -0800
  • 051a8f0ad6 dont attempt merge in quickpoll. just return do not core. Matt Wells 2015-03-02 07:26:38 -0800
  • a97f1fcad0 add skippedshards and totalshards to search results in xml/json so you know if you are missing results from a dead shard or not. all hosts must be dead in shard for shard to be dead. mwells 2015-02-27 08:17:32 -0700
  • f8db6288ae ignore dead shards when doing queries so they remain fast. mwells 2015-02-27 08:02:19 -0700
  • f5383d98db if a shard is dead skip it when searching. Matt 2015-02-27 07:28:41 -0700
  • bb5e0c9c63 another MAX_DGRAMS fix. Matt 2015-02-27 06:30:17 -0700
  • ee672bb3a3 say 'Injected url already indexed' and not 'Injection abandoned' for clarity's sake. Matt Wells 2015-02-26 16:04:51 -0700
  • 064d022d6f call mkdir on 'gb install' cmd. Matt 2015-02-25 19:49:36 -0700
  • d4f67285ce keep spiders maxed out all the time. mwells 2015-02-25 18:40:50 -0700
  • 5ae3ae60d3 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing mwells 2015-02-25 11:40:00 -0700
  • 4355665fa9 fix ssl handshake hanging forever. time it out now. spider performance fix for when a firstip has massive # of urls and can only have one spidered in the future. mwells 2015-02-25 11:39:13 -0700
  • 3f753f8114 sometimes tree data is NULL for some reason... fix core. Matt 2015-02-24 20:34:41 -0700
  • a5e66da609 fix crawls where the url crawl pattern and page process pattern are supplied, but not the url process pattern. such was the dribbble crawl. Matt Wells 2015-02-24 14:27:38 -0700
  • f41877c2e3 compute MIN hop count over all spider requests for the same url in Spider.cpp. Matt Wells 2015-02-24 13:41:18 -0700
  • 9436649531 Merge branch 'testing' Matt 2015-02-22 13:15:57 -0700
  • 692c2932e8 fixed bug of gb not saving mwells 2015-02-22 13:11:20 -0700
  • 4e485b6649 increase dolebuf cache time from 2 to 5 mins for better performance. cache empty dolebufs if winner tree list was not from cache, so in case we have a huge spiderdb scan list of urls we aren't spidering we can cache it, like twitter.com e.g. do not call strstr in getUrlFilterNum2() for .css? or /print/ since it was taking way too much cpu time. mwells 2015-02-21 15:17:28 -0700
  • a57de3289c Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-02-21 09:43:33 -0800